Paperid: 1, https://arxiv.org/pdf/2503.24391.pdf   GitHub
Authors:Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Title: Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Abstract:
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/
中文: Easi3R提出了一种无需训练的4D重建方法,通过解耦DUSt3R中的注意力机制实现动态场景建模,其性能显著优于依赖大量数据训练的现有方法。
English: Easi3R introduces a training-free 4D reconstruction method that adapts attention mechanisms in DUSt3R during inference, achieving superior dynamic scene modeling without pre-training or fine-tuning.

Authors:Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Title: Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Abstract:
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/
中文: Easi3R提出了一种无需训练的4D重建方法,通过解耦DUSt3R中的注意力机制实现动态场景建模,其性能显著优于依赖大量数据训练的现有方法。
English: Easi3R introduces a training-free 4D reconstruction method that adapts attention mechanisms in DUSt3R during inference, achieving superior dynamic scene modeling without pre-training or fine-tuning.

Authors:Chenyang Li, Wenxuan Liu, Guoqiang Gong, Xiaobo Ding, Xian Zhong
Title: SU-YOLO: Spiking Neural Network for Efficient Underwater Object Detection
Abstract:
Underwater object detection is critical for oceanic research and industrial safety inspections. However, the complex optical environment and the limited resources of underwater equipment pose significant challenges to achieving high accuracy and low power consumption. To address these issues, we propose Spiking Underwater YOLO (SU-YOLO), a Spiking Neural Network (SNN) model. Leveraging the lightweight and energy-efficient properties of SNNs, SU-YOLO incorporates a novel spike-based underwater image denoising method based solely on integer addition, which enhances the quality of feature maps with minimal computational overhead. In addition, we introduce Separated Batch Normalization (SeBN), a technique that normalizes feature maps independently across multiple time steps and is optimized for integration with residual structures to capture the temporal dynamics of SNNs more effectively. The redesigned spiking residual blocks integrate the Cross Stage Partial Network (CSPNet) with the YOLO architecture to mitigate spike degradation and enhance the model's feature extraction capabilities. Experimental results on URPC2019 underwater dataset demonstrate that SU-YOLO achieves mAP of 78.8% with 6.97M parameters and an energy consumption of 2.98 mJ, surpassing mainstream SNN models in both detection accuracy and computational efficiency. These results underscore the potential of SNNs for engineering applications. The code is available in https://github.com/lwxfight/snn-underwater.
中文:提出的脉冲水下YOLO(SU-YOLO)模型利用脉冲神经网络的高能效特性,结合创新的去噪和归一化方法,在低功耗条件下实现了卓越的水下目标检测性能。
English: The proposed Spiking Underwater YOLO (SU-YOLO) model leverages spiking neural networks' energy efficiency and introduces novel denoising and normalization techniques to achieve superior underwater object detection accuracy with minimal computational resources.

Authors:Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong
Title: Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field.
中文: 本综述通过分析快速直觉思维的计算效率与深度推理的准确性之间的权衡,探究大语言模型的推理经济性,涵盖训练和推理阶段中的效率成因、行为模式及潜在解决方案。
English: This survey explores reasoning economy in Large Language Models by analyzing the trade-offs between System 1's computational efficiency and System 2's accuracy, examining causes of inefficiency, behavioral patterns, and potential solutions across training and inference stages.

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
Title: Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Abstract:
Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
Chinese: 强化学习的最新进展提升了多模态大语言模型的推理能力,SEED-Bench-R1基准测试验证了其有效性,但在复杂视觉推理任务中保持逻辑连贯性仍是待解决的挑战。
English: Recent advancements in reinforcement learning have enhanced multimodal large language models' reasoning capabilities, as demonstrated by the SEED-Bench-R1 benchmark, though challenges remain in maintaining logical coherence during complex visual reasoning tasks.

Authors:Shuaizheng Liu, Jianqi Ma, Lingchen Sun, Xiangtao Kong, Lei Zhang
Title: InstructRestore: Region-Customized Image Restoration with Human Instructions
Abstract:
Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.
中文: 本文提出了InstructRestore新框架,通过类似ControlNet的模型实现基于用户指令的区域定制化图像修复,能够根据描述对特定区域进行差异化细节增强。
English: This paper introduces InstructRestore, a novel framework that enables region-specific image restoration guided by human instructions, utilizing a ControlNet-like model to adaptively enhance details in targeted areas based on user input.

Authors:Adam Schmidt, Mert Asim Karaoglu, Soham Sinha, Mingang Jang, Ho-Gun Ha, Kyungmin Jung, Kyeongmo Gu, Ihsan Ullah, Hyunki Lee, Jonáš Šerých, Michal Neoral, Jiří Matas, Rulin Zhou, Wenlong He, An Wang, Hongliang Ren, Bruno Silva, Sandro Queirós, Estêvão Lima, João L. Vilaça, Shunsuke Kikuchi, Atsushi Kouno, Hiroki Matsuzaki, Tongtong Li, Yulu Chen, Ling Li, Xiang Ma, Xiaojian Li, Mona Sheikh Zeinoddin, Xu Wang, Zafer Tandogdu, Greg Shaw, Evangelos Mazomenos, Danail Stoyanov, Yuxin Chen, Zijian Wu, Alexander Ladikos, Simon DiMaio, Septimiu E. Salcudean, Omid Mohareri
Title: Point Tracking in Surgery--The 2024 Surgical Tattoos in Infrared (STIR) Challenge
Abstract:
Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: https://zenodo.org/records/14803158 , and the code for baseline models and metric calculation is available here: https://github.com/athaddius/STIRMetrics
中文: STIR挑战赛2024通过点追踪任务评估算法在STIR数据集上的准确性与效率,旨在推动手术空间理解技术的发展,本文详细记录了八支参赛队伍的设计方案与比赛结果。
English: The STIR Challenge 2024 was introduced to advance surgical spatial understanding by evaluating algorithm accuracy and efficiency through point tracking on the STIR dataset, with results from eight participating teams detailed in this paper.

Authors:Sewoong Lee, Adam Davies, Marc E. Canby, Julia Hockenmaier
Title: Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Abstract:
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.
中文: 本文揭示了稀疏与稠密向量$\ell_2$范数间的理论关联,使得稀疏自编码器无需手动设定$\ell_0$超参数即可训练,并提出了新的评估方法和动态激活函数,在实验中优于当前最优的$k$-稀疏自编码器。
English: This paper establishes a theoretical connection between the $\ell_2$-norms of sparse and dense vectors, enabling sparse autoencoders to be trained without manually setting the $\ell_0$ hyperparameter, and introduces both a new evaluation method and a dynamic activation function that outperforms current $k$-sparse autoencoders.

Authors:Zhengren Wang, Rui Ling, Chufan Wang, Yongan Yu, Sizhe Wang, Zhiyu Li, Feiyu Xiong, Wentao Zhang
Title: MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Abstract:
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: \textit{maintainability}. To handle dynamic requirements with minimal rework, we propose \textbf{MaintainCoder} as a pioneering solution. It integrates the Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, achieving clear responsibility boundaries and better maintainability. We also introduce \textbf{MaintainBench}, a benchmark comprising requirement changes and novel dynamic metrics on maintenance efforts. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves dynamic maintainability metrics by more than 60\% with even higher correctness of initial codes. Furthermore, while static metrics fail to accurately reflect maintainability and even contradict each other, our proposed dynamic metrics exhibit high consistency. Our work not only provides the foundation for maintainable code generation, but also highlights the need for more realistic and comprehensive code generation research.
中文摘要:MaintainCoder通过融合瀑布模型、设计模式和多智能体协作,提出了一种创新方法,将动态可维护性指标提升超过60%,同时揭示了现有静态指标在反映代码可维护性方面的不足。
English Summary: MaintainCoder introduces a novel approach integrating the Waterfall model, design patterns, and multi-agent collaboration to significantly enhance code maintainability, improving dynamic metrics by over 60% while demonstrating the limitations of existing static metrics.

Authors:Zhengren Wang, Rui Ling, Chufan Wang, Yongan Yu, Sizhe Wang, Zhiyu Li, Feiyu Xiong, Wentao Zhang
Title: MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Abstract:
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates the Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, achieving clear responsibility boundaries and better maintainability. We also introduce MaintainCoder, a benchmark comprising requirement changes and novel dynamic metrics on maintenance efforts. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves dynamic maintainability metrics by more than 60% with even higher correctness of initial codes. Furthermore, while static metrics fail to accurately reflect maintainability and even contradict each other, our proposed dynamic metrics exhibit high consistency. Our work not only provides the foundation for maintainable code generation, but also highlights the need for more realistic and comprehensive code generation research. Resources: https://github.com/IAAR-Shanghai/MaintainCoder.
中文摘要:MaintainCoder通过融合瀑布模型、设计模式和多智能体协作,提出了一种创新方法,将动态可维护性指标提升超过60%,同时揭示了现有静态指标在反映代码可维护性方面的不足。
English Summary: MaintainCoder introduces a novel approach integrating the Waterfall model, design patterns, and multi-agent collaboration to significantly enhance code maintainability, improving dynamic metrics by over 60% while demonstrating the limitations of existing static metrics.

Authors:Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, Chen Ma
Title: A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Abstract:
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions. Our repository is available on https://github.com/testtimescaling/testtimescaling.github.io/
中文摘要:随着预训练阶段计算扩展的热潮渐退,测试时扩展已成为研究热点,它能显著提升大语言模型在推理和通用任务中的能力,本文为此提出系统性框架并总结实践指南。
English Summary: As interest in scaling computation during pretraining wanes, test-time scaling has become a key research area, enhancing LLMs' capabilities in reasoning and general tasks, prompting this comprehensive survey that introduces a framework and practical guidelines for the field.

Authors:Karim Radouane, Hanane Azzag, Mustapha lebbah
Title: MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
Abstract:
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.
中文摘要:本研究提出了一种统一框架,将目标检测与视觉定位相结合应用于遥感图像,通过微调开放集检测器和任务感知架构,在基准数据集上实现了优于现有方法的性能,同时保持了传统检测功能。
English Summary: This study introduces a unified framework that combines object detection and visual grounding for remote sensing imagery, utilizing a fine-tuned open-set detector and a task-aware architecture to achieve state-of-the-art performance on benchmark datasets while preserving traditional detection capabilities.

Authors:Jianhao Li, Xianchao Xiu
Title: LLM4FS: Leveraging Large Language Models for Feature Selection
Abstract:
Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.
中文: 本文提出LLM4FS混合策略,将大语言模型与传统数据驱动方法相结合,通过整合语境理解能力和统计可靠性优势,在特征选择任务中实现卓越性能,同时指出了该方法在决策应用中的局限性。
English: This paper introduces LLM4FS, a hybrid strategy that combines large language models with traditional data-driven methods to enhance automated feature selection by leveraging their respective strengths in contextual understanding and statistical reliability, achieving superior performance despite noted limitations in decision-making applications.

Authors:Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
Title: TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Abstract:
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.
中文: 本文提出了首个开源音频-文本电信反欺诈数据集TeleAntiFraud-28k,通过隐私保护合成和语义增强策略解决数据稀缺问题,并配套开发了评估基准与优化模型,为多模态反欺诈研究奠定基础。
English: This paper introduces TeleAntiFraud-28k, the first open-source audio-text dataset designed for telecom fraud detection, addressing data scarcity through privacy-preserved synthesis and semantic enhancement strategies, along with a benchmark and fine-tuned model for multimodal anti-fraud research.

Authors:Xiangyuan Peng, Miao Tang, Huawei Sun, Kay Bierzynski, Lorenzo Servadei, Robert Wille
Title: 4D mmWave Radar for Sensing Enhancement in Adverse Environments: Advances and Challenges
Abstract:
Intelligent transportation systems require accurate and reliable sensing. However, adverse environments, such as rain, snow, and fog, can significantly degrade the performance of LiDAR and cameras. In contrast, 4D mmWave radar not only provides 3D point clouds and velocity measurements but also maintains robustness in challenging conditions. Recently, research on 4D mmWave radar under adverse environments has been growing, but a comprehensive review is still lacking. To bridge this gap, this work reviews the current research on 4D mmWave radar under adverse environments. First, we present an overview of existing 4D mmWave radar datasets encompassing diverse weather and lighting scenarios. Subsequently, we analyze existing learning-based methods leveraging 4D mmWave radar to enhance performance according to different adverse conditions. Finally, the challenges and potential future directions are discussed for advancing 4D mmWave radar applications in harsh environments. To the best of our knowledge, this is the first review specifically concentrating on 4D mmWave radar in adverse environments. The related studies are listed at: https://github.com/XiangyPeng/4D-mmWave-Radar-in-Adverse-Environments.
中文: 本综述首次专门聚焦于恶劣环境下的4D毫米波雷达研究,系统梳理了多场景数据集与基于学习的增强方法,并探讨了该技术在智能交通系统中面临的挑战与发展方向。
English: This review addresses the lack of comprehensive analysis by summarizing current research on 4D mmWave radar, which maintains robust performance in adverse conditions like rain and fog, covering datasets, learning-based methods, and future challenges for intelligent transportation systems.

Authors:Tong Xie, Jiawang Zhao, Zishen Wan, Zuodong Zhang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li
Title: ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance
Abstract:
The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often reserve a large voltage margin or leverage algorithm-based fault tolerance (ABFT) techniques to ensure LLM inference correctness. However, previous methods often overlook the inherent fault tolerance of LLMs, leading to high computation and energy overhead. To enable reliable yet efficient LLM inference, in this paper, we propose a novel algorithm/circuit co-design framework, dubbed ReaLM. For the first time, we systematically characterize the fault tolerance of LLMs by performing a large-scale error injection study of representative LLMs and natural language understanding tasks. Then, we propose a statistical ABFT algorithm that fully leverages the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance. Our error injection code is available at https://github.com/PKU-SEC-Lab/ReaLM_DAC25/
Chinese Summary: 本文提出ReaLM这一算法/电路协同设计框架,通过系统利用大语言模型固有的容错特性,在保证可靠性的同时显著降低能耗,仅需极小的硬件开销即可实现高效推理。
English Summary: This paper introduces ReaLM, a novel algorithm/circuit co-design framework that leverages the inherent fault tolerance of large language models to enable reliable and energy-efficient inference with minimal hardware overhead.

Authors:Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen
Title: Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
Abstract:
Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.
中文: KV缓存压缩技术虽能降低大语言模型服务中的内存消耗,但在实际部署中存在吞吐量不足和延迟增加等问题,本文通过全面评估并开源工具以推动其未来发展。
English: KV cache compression reduces memory usage in LLM serving but faces practical deployment challenges, including suboptimal throughput and increased latency, prompting a comprehensive evaluation and open-source tools for future improvements.

Authors:Yanbo Wang, Yongtao Chen, Chuan Cao, Tianchen Deng, Wentao Zhao, Jingchuan Wang, Weidong Chen
Title: SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency
Abstract:
We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT's open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at https://github.com/Cavendish518/SALT.
中文: 我们推出了SALT,一种半自动标注工具,采用零样本学习范式直接处理LiDAR数据,在多种LiDAR数据集上实现了卓越的分割效果,并显著提升了标注效率。
English: We introduce SALT, a semi-automatic labeling tool that uses a zero-shot learning approach to directly process LiDAR data, achieving superior segmentation results and significantly improving annotation efficiency across diverse LiDAR datasets.

Authors:Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Sepideh Hatamikia, Diana Mechtcheriakova, Amirreza Mahbod
Title: A Multi-Stage Auto-Context Deep Learning Framework for Tissue and Nuclei Segmentation and Classification in H&E-Stained Histological Images of Advanced Melanoma
Abstract:
Melanoma is the most lethal form of skin cancer, with an increasing incidence rate worldwide. Analyzing histological images of melanoma by localizing and classifying tissues and cell nuclei is considered the gold standard method for diagnosis and treatment options for patients. While many computerized approaches have been proposed for automatic analysis, most perform tissue-based analysis and nuclei (cell)-based analysis as separate tasks, which might be suboptimal. In this work, using the PUMA challenge dataset, we propose a novel multi-stage deep learning approach by combining tissue and nuclei information in a unified framework based on the auto-context concept to perform segmentation and classification in histological images of melanoma. Through pre-training and further post-processing, our approach achieved second and first place rankings in the PUMA challenge, with average micro Dice tissue score and summed nuclei F1-score of 73.40% for Track 1 and 63.48% for Track 2, respectively. Furthermore, through a comprehensive ablation study and additional evaluation on an external dataset, we demonstrated the effectiveness of the framework components as well as the generalization capabilities of the proposed approach. Our implementation for training and testing is available at: https://github.com/NimaTorbati/PumaSubmit
Chinese: 本研究提出了一种多阶段深度学习框架,通过整合组织和细胞核信息进行黑色素瘤诊断,在PUMA挑战赛中取得领先排名,并通过消融实验和外部验证证明了其优异的泛化能力。
English: This study introduces a multi-stage deep learning framework that integrates tissue and nuclei analysis for melanoma diagnosis, achieving top rankings in the PUMA challenge and demonstrating strong generalization through ablation studies and external validation.

Authors:Sebastian Springer, Andre Scaffidi, Maximilian Autenrieth, Gabriella Contardo, Alessandro Laio, Roberto Trotta, Heikki Haario
Title: Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics
Abstract:
Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.
中文: EagleEye是一种无监督异常检测方法,通过将邻域成员标签序列建模为抛硬币过程来识别多元数据集中的局部密度差异,在合成和真实数据实验中均能有效检测细微异常。
English: EagleEye is an unsupervised anomaly detection method that identifies localized density differences in multivariate datasets by modeling neighbor membership sequences as a coin-flipping process, demonstrating high accuracy in detecting subtle anomalies across synthetic and real-world applications.

Authors:Ruisheng Han, Kanglei Zhou, Amir Atapour-Abarghouei, Xiaohui Liang, Hubert P. H. Shum
Title: FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment
Abstract:
Action quality assessment (AQA) is critical for evaluating athletic performance, informing training strategies, and ensuring safety in competitive sports. However, existing deep learning approaches often operate as black boxes and are vulnerable to spurious correlations, limiting both their reliability and interpretability. In this paper, we introduce FineCausal, a novel causal-based framework that achieves state-of-the-art performance on the FineDiving-HM dataset. Our approach leverages a Graph Attention Network-based causal intervention module to disentangle human-centric foreground cues from background confounders, and incorporates a temporal causal attention module to capture fine-grained temporal dependencies across action stages. This dual-module strategy enables FineCausal to generate detailed spatio-temporal representations that not only achieve state-of-the-art scoring performance but also provide transparent, interpretable feedback on which features drive the assessment. Despite its strong performance, FineCausal requires extensive expert knowledge to define causal structures and depends on high-quality annotations, challenges that we discuss and address as future research directions. Code is available at https://github.com/Harrison21/FineCausal.
Chinese: FineCausal提出了一种新颖的因果框架,利用图注意力和时序模块分离前景线索与背景干扰,在提升可解释性的同时实现了动作质量评估的最优性能。
English: FineCausal introduces a novel causal framework using graph attention and temporal modules to enhance interpretability and achieve state-of-the-art action quality assessment by disentangling foreground cues from background confounders.

Authors:Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, Jie Song
Title: Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
Abstract:
MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at https://github.com/hqhQAQ/Hint-GRPO.
中文摘要:本文提出的Hint-GRPO方法通过自适应提示和文本偏差校准,有效解决了GRPO在数据利用和图像依赖方面的不足,大幅提升了多模态大语言模型的推理能力。
English Summary: This paper introduces Hint-GRPO and text-bias calibration to overcome GRPO's limitations in low data utilization and text-bias, significantly enhancing multimodal large language models' reasoning performance across diverse datasets.

Authors:Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery
Title: Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Abstract:
The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code are available at https://github.com/RubriksCube/rubriks_cube.
中文: 大型语言模型虽广泛用于生成解释,但其可靠性存疑,为此我们推出Rubrik's CUBE,这是一个基于教育理念的评分标准和包含2.6万条解释的数据集,用于评估不同任务中解释的质量。
English: Large-Language Models are increasingly used for generating explanations but often produce unreliable results, leading to the development of Rubrik's CUBE, a rubric and dataset to evaluate explanation quality across various tasks.

Authors:Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, Kang Liu
Title: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.
中文: 检索增强生成(RAG)通过引入外部文档提升大语言模型性能,但面临高推理成本和幻觉问题;动态参数化RAG(DyPRAG)通过轻量参数转换器将文档高效转化为参数知识,在测试时降低各类成本并增强模型泛化能力。
English: Retrieval-augmented generation (RAG) improves large language models by incorporating external documents but faces issues like high inference costs and hallucinations, which the proposed Dynamic Parametric RAG (DyPRAG) addresses by efficiently converting documents into parametric knowledge to reduce costs and enhance performance at test-time.

Authors:Wenkang Ji, Huaben Chen, Mingyang Chen, Guobin Zhu, Lufeng Xu, Roderich Groß, Rui Zhou, Ming Cao, Shiyu Zhao
Title: GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
Abstract:
The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development cycle. This work introduces \textit{GenSwarm}, an end-to-end system that leverages large language models to automatically generate and deploy control policies for multi-robot tasks based on simple user instructions in natural language. As a multi-language-agent system, GenSwarm achieves zero-shot learning, enabling rapid adaptation to altered or unseen tasks. The white-box nature of the code policies ensures strong reproducibility and interpretability. With its scalable software and hardware architectures, GenSwarm supports efficient policy deployment on both simulated and real-world multi-robot systems, realizing an instruction-to-execution end-to-end functionality that could prove valuable for robotics specialists and non-specialists alike.The code of the proposed GenSwarm system is available online: https://github.com/WindyLab/GenSwarm.
Chinese: GenSwarm是一种端到端系统,利用大型语言模型根据简单的自然语言指令自动生成并部署多机器人任务的控制策略,实现了零样本学习,并能在仿真和现实应用中高效部署。
English: GenSwarm is an end-to-end system that uses large language models to automatically generate and deploy control policies for multi-robot tasks from simple natural language instructions, enabling zero-shot learning and efficient deployment in both simulations and real-world applications.

Authors:SeonYeong Lee, EonSeung Seong, DongEon Lee, SiYeoul Lee, Yubin Cho, Chunsu Park, Seonho Kim, MinKyung Seo, YoungSin Ko, MinWoo Kim
Title: Learned Image Compression and Restoration for Digital Pathology
Abstract:
Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: https://github.com/pnu-amilab/CLERIC.
中文:CLERIC是一种基于深度学习的新型数字病理图像压缩框架,通过可学习的提升方案和先进卷积技术,在提升存储效率的同时保持了诊断所需的图像质量。
English: CLERIC is a novel deep learning-based compression framework for digital pathology images that enhances storage efficiency while preserving diagnostic quality through advanced lifting schemes and convolutional techniques.

Authors:Nicolas Gillis, Margherita Porcelli, Giovanni Seraghiti
Title: An extrapolated and provably convergent algorithm for nonlinear matrix decomposition with the ReLU function
Abstract:
Nonlinear matrix decomposition (NMD) with the ReLU function, denoted ReLU-NMD, is the following problem: given a sparse, nonnegative matrix $X$ and a factorization rank $r$, identify a rank-$r$ matrix $Θ$ such that $X\approx \max(0,Θ)$. This decomposition finds application in data compression, matrix completion with entries missing not at random, and manifold learning. The standard ReLU-NMD model minimizes the least squares error, that is, $\|X - \max(0,Θ)\|_F^2$. The corresponding optimization problem is nondifferentiable and highly nonconvex. This motivated Saul to propose an alternative model, Latent-ReLU-NMD, where a latent variable $Z$ is introduced and satisfies $\max(0,Z)=X$ while minimizing $\|Z - Θ\|_F^2$ (``A nonlinear matrix decomposition for mining the zeros of sparse data'', SIAM J. Math. Data Sci., 2022). Our first contribution is to show that the two formulations may yield different low-rank solutions $Θ$; in particular, we show that Latent-ReLU-NMD can be ill-posed when ReLU-NMD is not, meaning that there are instances in which the infimum of Latent-ReLU-NMD is not attained while that of ReLU-NMD is. We also consider another alternative model, called 3B-ReLU-NMD, which parameterizes $Θ=WH$, where $W$ has $r$ columns and $H$ has $r$ rows, allowing one to get rid of the rank constraint in Latent-ReLU-NMD. Our second contribution is to prove the convergence of a block coordinate descent (BCD) applied to 3B-ReLU-NMD and referred to as BCD-NMD. Our third contribution is a novel extrapolated variant of BCD-NMD, dubbed eBCD-NMD, which we prove is also convergent under mild assumptions. We illustrate the significant acceleration effect of eBCD-NMD compared to BCD-NMD, and also show that eBCD-NMD performs well against the state of the art on synthetic and real-world data sets.
中文: 本文分析了基于ReLU的非线性矩阵分解,揭示了Latent-ReLU-NMD模型相比标准ReLU-NMD可能不适定,并提出了一种可证明收敛的外推块坐标下降法(eBCD-NMD),在合成和真实数据集上显著提升了性能。
English: This paper analyzes ReLU-based nonlinear matrix decomposition (NMD), revealing that the Latent-ReLU-NMD model can be ill-posed compared to standard ReLU-NMD, and introduces a provably convergent extrapolated block coordinate descent method (eBCD-NMD) that significantly accelerates performance on synthetic and real-world datasets.

Authors:Emmanouil Georgios Lionis, Jia-Huei Ju
Title: On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents
Abstract:
Document retrieval is one of the most challenging tasks in Information Retrieval. It requires handling longer contexts, often resulting in higher query latency and increased computational overhead. Recently, Learned Sparse Retrieval (LSR) has emerged as a promising approach to address these challenges. Some have proposed adapting the LSR approach to longer documents by aggregating segmented document using different post-hoc methods, including n-grams and proximity scores, adjusting representations, and learning to ensemble all signals. In this study, we aim to reproduce and examine the mechanisms of adapting LSR for long documents. Our reproducibility experiments confirmed the importance of specific segments, with the first segment consistently dominating document retrieval performance. Furthermore, We re-evaluate recently proposed methods -- ExactSDM and SoftSDM -- across varying document lengths, from short (up to 2 segments) to longer (3+ segments). We also designed multiple analyses to probe the reproduced methods and shed light on the impact of global information on adapting LSR to longer contexts. The complete code and implementation for this project is available at: https://github.com/lionisakis/Reproducibilitiy-lsr-long.
中文: 本研究复现并分析了针对长文档的习得稀疏检索适应方法,证实了首段落对检索性能的主导作用,并通过评估不同文档长度下的方法来揭示全局信息的影响。
English: This study reproduces and analyzes Learned Sparse Retrieval (LSR) adaptation for long documents, confirming the first segment's dominance in retrieval performance and evaluating methods across varying document lengths to understand global information's impact.

Authors:Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen, Fei Huang, Binhua Li
Title: Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
Abstract:
Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods. Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46\% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner
中文摘要:本研究提出了一种测试时计算扩展框架,通过内部轨迹合成和外部搜索策略增强代码推理能力,使32B模型在SWE-bench上实现46%的问题解决率,显著超越了规模更大的模型。
English Summary: This study introduces a Test-Time Compute scaling framework that enables a 32B model to achieve a 46% issue resolution rate on SWE-bench, surpassing much larger models by enhancing code reasoning through internal trajectory synthesis and external search strategies.

Authors:Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Abstract:
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).
中文摘要:On-device Sora 提出首个无需训练的移动端文本到视频生成方案,通过线性比例跳跃、时间维度令牌合并和动态加载并发推理三项创新技术,在智能手机上实现与高端GPU相媲美的高质量视频生成,有效克服了移动设备计算和内存限制。
English Summary: On-device Sora introduces a training-free solution for efficient text-to-video generation on smartphones through three novel techniques—Linear Proportional Leap, Temporal Dimension Token Merging, and Concurrent Inference with Dynamic Loading—enabling high-quality video creation comparable to high-end GPUs on resource-constrained devices.

Authors:Fabian L. Thiemann, Thiago Reschützegger, Massimiliano Esposito, Tseden Taddese, Juan D. Olarte-Plata, Fausto Martelli
Title: Force-Free Molecular Dynamics Through Autoregressive Equivariant Networks
Abstract:
Molecular dynamics (MD) simulations play a crucial role in scientific research. Yet their computational cost often limits the timescales and system sizes that can be explored. Most data-driven efforts have been focused on reducing the computational cost of accurate interatomic forces required for solving the equations of motion. Despite their success, however, these machine learning interatomic potentials (MLIPs) are still bound to small time-steps. In this work, we introduce TrajCast, a transferable and data-efficient framework based on autoregressive equivariant message passing networks that directly updates atomic positions and velocities lifting the constraints imposed by traditional numerical integration. We benchmark our framework across various systems, including a small molecule, crystalline material, and bulk liquid, demonstrating excellent agreement with reference MD simulations for structural, dynamical, and energetic properties. Depending on the system, TrajCast allows for forecast intervals up to $30\times$ larger than traditional MD time-steps, generating over 15 ns of trajectory data per day for a solid with more than 4,000 atoms. By enabling efficient large-scale simulations over extended timescales, TrajCast can accelerate materials discovery and explore physical phenomena beyond the reach of traditional simulations and experiments. An open-source implementation of TrajCast is accessible under https://github.com/IBM/trajcast.
中文摘要:TrajCast采用自回归等变消息传递网络直接更新原子位置和速度,相比传统分子动力学方法将时间步长提升高达30倍,在保持精度的同时实现了更高效的大尺度材料模拟。
English Summary: TrajCast is a novel framework using autoregressive equivariant networks to directly update atomic positions and velocities, enabling simulations with time-steps up to 30 times larger than traditional molecular dynamics while maintaining accuracy across various material systems.

Authors:Haoran Shen, Peixian Zhuang, Jiahao Kou, Yuxin Zeng, Haoying Xu, Jiangyun Li
Title: MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation
Abstract:
Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at https://github.com/sevenshr/MGD-SAM2.
中文: 提出的MGD-SAM2模型通过集成多视角特征交互和细节优化模块,显著提升了SAM2在高分辨率图像上的分割精度,在多个数据集上展现出优越性能。
English: The proposed MGD-SAM2 model enhances SAM2's segmentation precision for high-resolution images by integrating multi-view feature interactions and a detail refinement module, achieving superior performance across diverse datasets.

Authors:Lu Fan, Jiashu Pu, Rongsheng Zhang, Xiao-Ming Wu
Title: LANID: LLM-assisted New Intent Discovery
Abstract:
Task-oriented Dialogue Systems (TODS) often face the challenge of encountering new intents. New Intent Discovery (NID) is a crucial task that aims to identify these novel intents while maintaining the capability to recognize existing ones. Previous efforts to adapt TODS to new intents have struggled with inadequate semantic representation or have depended on external knowledge, which is often not scalable or flexible. Recently, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities; however, their scale can be impractical for real-world applications that involve extensive queries. To address the limitations of existing NID methods by leveraging LLMs, we propose LANID, a framework that enhances the semantic representation of lightweight NID encoders with the guidance of LLMs. Specifically, LANID employs the $K$-nearest neighbors and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms to sample selective utterance pairs from the training set. It then queries an LLM to ascertain the relationships between these pairs. The data produced from this process is utilized to design a contrastive fine-tuning task, which is then used to train a small encoder with a contrastive triplet loss. Our experimental results demonstrate the efficacy of the proposed method across three distinct NID datasets, surpassing strong baselines in both unsupervised and semi-supervised settings. Our code is available at https://github.com/floatSDSDS/LANID.
Chinese: LANID框架利用大型语言模型(LLMs)生成选择性话语对,并通过对比性微调增强轻量级新意图发现(NID)编码器的语义表示,在无监督和半监督设置下,于多个数据集上实现了卓越性能。
English: The LANID framework enhances lightweight New Intent Discovery (NID) encoders by leveraging Large Language Models (LLMs) to generate selective utterance pairs and employing contrastive fine-tuning, achieving superior performance across multiple datasets in both unsupervised and semi-supervised settings.

Authors:Yoonshik Kim, Jaeyoon Jung
Title: KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Abstract:
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA
中文: 现有视觉语言模型评估方法常限制开放性回答或依赖主观评判,因此我们推出KOFFVQA韩语开放式视觉问答基准,通过预设评分标准实现客观可靠的模型评估。
English: Current VLM evaluations often sacrifice open-endedness or rely on subjective judge models, so we introduce KOFFVQA—a Korean free-form visual question answering benchmark with objective grading criteria to ensure reliable evaluation.

Authors:Hongwei Ren, Xiaopeng Lin, Hongxiang Huang, Yue Zhou, Bojun Cheng
Title: Exploring Temporal Dynamics in Event-based Eye Tracker
Abstract:
Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.
Chinese: TDTracker是一种创新的眼动追踪框架,通过结合3D卷积神经网络和级联结构来捕捉短期与长期时间动态,利用事件相机实现了高速高精度的眼动追踪,并取得了最先进的性能表现。
English: TDTracker is an innovative eye-tracking framework that employs 3D CNNs and a cascaded structure to capture both short-term and long-term temporal dynamics, achieving state-of-the-art performance in high-speed, high-precision eye movement tracking with event cameras.

Authors:Yi Liu, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang
Title: Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space
Abstract:
Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.
中文: 本文提出EMRDM模型,基于均值回复扩散方法,通过模块化框架和改进流程直接将多云图像转为无云图像,在实验中展现出卓越性能。
English: The paper introduces EMRDM, a cloud removal model based on mean-reverting diffusion, which directly transforms cloudy images into cloudless ones through a modular framework and enhanced processes, demonstrating superior performance in experiments.

Authors:Haitao Tian, Junyang Li, Chenxing Wang, Helong Jiang
Title: Detail-aware multi-view stereo network for depth estimation
Abstract:
Multi-view stereo methods have achieved great success for depth estimation based on the coarse-to-fine depth learning frameworks, however, the existing methods perform poorly in recovering the depth of object boundaries and detail regions. To address these issues, we propose a detail-aware multi-view stereo network (DA-MVSNet) with a coarse-to-fine framework. The geometric depth clues hidden in the coarse stage are utilized to maintain the geometric structural relationships between object surfaces and enhance the expressive capability of image features. In addition, an image synthesis loss is employed to constrain the gradient flow for detailed regions and further strengthen the supervision of object boundaries and texture-rich areas. Finally, we propose an adaptive depth interval adjustment strategy to improve the accuracy of object reconstruction. Extensive experiments on the DTU and Tanks & Temples datasets demonstrate that our method achieves competitive results. The code is available at https://github.com/wsmtht520-/DAMVSNet.
中文: 提出的DA-MVSNet通过利用几何深度线索、采用图像合成损失和自适应深度间隔策略,有效改善了物体边界和细节区域的深度恢复效果,在基准数据集上取得了优异性能。
English: The proposed DA-MVSNet addresses poor depth recovery at object boundaries and detailed regions by leveraging geometric depth clues, employing an image synthesis loss, and implementing an adaptive depth interval strategy, achieving competitive results on benchmark datasets.

Authors:Aly Lidayan, Yuqing Du, Eliza Kosoy, Maria Rufova, Pieter Abbeel, Alison Gopnik
Title: Intrinsically-Motivated Humans and Agents in Open-World Exploration
Abstract:
What drives exploration? Understanding intrinsic motivation is a long-standing challenge in both cognitive science and artificial intelligence; numerous objectives have been proposed and used to train agents, yet there remains a gap between human and agent exploration. We directly compare adults, children, and AI agents in a complex open-ended environment, Crafter, and study how common intrinsic objectives: Entropy, Information Gain, and Empowerment, relate to their behavior. We find that only Entropy and Empowerment are consistently positively correlated with human exploration progress, indicating that these objectives may better inform intrinsic reward design for agents. Furthermore, across agents and humans we observe that Entropy initially increases rapidly, then plateaus, while Empowerment increases continuously, suggesting that state diversity may provide more signal in early exploration, while advanced exploration should prioritize control. Finally, we find preliminary evidence that private speech utterances, and particularly goal verbalizations, may aid exploration in children. Our data is available at https://github.com/alyd/humans_in_crafter_data.
中文摘要:本研究在Crafter环境中对比人类与AI的探索行为,发现熵和赋权是与人类探索进展正相关的关键内在动机,初步证据表明儿童的自言自语可能促进其探索能力。
English Summary: The study compares human and AI exploration in the Crafter environment, finding that Entropy and Empowerment are key intrinsic motivators positively linked to human-like exploration, with preliminary evidence suggesting children's verbalizations may enhance their exploratory behavior.

Authors:Anirudh Satheesh, Keenan Powell
Title: A Constrained Multi-Agent Reinforcement Learning Approach to Autonomous Traffic Signal Control
Abstract:
Traffic congestion in modern cities is exacerbated by the limitations of traditional fixed-time traffic signal systems, which fail to adapt to dynamic traffic patterns. Adaptive Traffic Signal Control (ATSC) algorithms have emerged as a solution by dynamically adjusting signal timing based on real-time traffic conditions. However, the main limitation of such methods is that they are not transferable to environments under real-world constraints, such as balancing efficiency, minimizing collisions, and ensuring fairness across intersections. In this paper, we view the ATSC problem as a constrained multi-agent reinforcement learning (MARL) problem and propose a novel algorithm named Multi-Agent Proximal Policy Optimization with Lagrange Cost Estimator (MAPPO-LCE) to produce effective traffic signal control policies. Our approach integrates the Lagrange multipliers method to balance rewards and constraints, with a cost estimator for stable adjustment. We also introduce three constraints on the traffic network: GreenTime, GreenSkip, and PhaseSkip, which penalize traffic policies that do not conform to real-world scenarios. Our experimental results on three real-world datasets demonstrate that MAPPO-LCE outperforms three baseline MARL algorithms by across all environments and traffic constraints (improving on MAPPO by 12.60%, IPPO by 10.29%, and QTRAN by 13.10%). Our results show that constrained MARL is a valuable tool for traffic planners to deploy scalable and efficient ATSC methods in real-world traffic networks. We provide code at https://github.com/Asatheesh6561/MAPPO-LCE.
中文: 本文提出MAPPO-LCE算法,通过约束多智能体强化学习动态调整交通信号,在平衡现实约束的同时显著优于现有方法,为实际交通网络提供了可扩展的解决方案。
English: The paper proposes MAPPO-LCE, a constrained multi-agent reinforcement learning algorithm that dynamically adjusts traffic signals while balancing real-world constraints, demonstrating superior performance over existing methods in real-world traffic networks.

Authors:Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan
Title: A Survey on Unlearnable Data
Abstract:
Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.
中文摘要:本综述首次将不可学习数据(ULD)作为独立领域进行全面探讨,系统梳理了其生成方法、评估指标及应用场景,并剖析了现有技术的局限性及未来研究方向,以推动机器学习数据保护能力的发展。
English Summary: This survey provides a comprehensive review of unlearnable data (ULD) as an independent defense technique, covering generation methods, benchmarks, and applications while analyzing trade-offs and future directions to enhance data protection in machine learning.

Authors:Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Weinan E, Linpeng Tang, Wentao Zhang
Title: RARE: Retrieval-Augmented Reasoning Modeling
Abstract:
Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom's Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts with masked losses, RARE transforms learning objectives from rote memorization to contextualized reasoning. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Extensive experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20\% accuracy. RARE establishes a paradigm shift where maintainable external knowledge bases synergize with compact, reasoning-optimized models, collectively driving more scalable domain-specific intelligence.
中文: RARE提出了一种将知识存储与推理优化分离的新范式,通过外部化领域知识和内部化推理模式,使轻量级模型实现顶尖性能,准确率超越大型模型高达20%。
English: RARE introduces a paradigm that separates knowledge storage from reasoning optimization, enabling lightweight models to achieve state-of-the-art performance by externalizing domain knowledge and internalizing reasoning patterns, surpassing larger models by up to 20% accuracy.

Authors:Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu
Title: ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 \(\mathcal{J}\&\mathcal{F}\) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE-Laboratory/ReferDINO-Plus.
中文: ReferDINO-Plus通过融合SAM2的优质掩码和条件掩码融合策略,提升了视频对象分割性能,在CVPR 2025 MeViS挑战赛中以60.43分获得第二名。
English: ReferDINO-Plus enhances video object segmentation by integrating SAM2's superior mask quality and a conditional fusion strategy, achieving second place in the CVPR 2025 MeViS challenge with a score of 60.43.

Authors:Ashim Dahal, Saydul Akbar Murad, Nick Rahimi
Title: Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning
Abstract:
Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense. The code implementation for this study can be found on \href{https://github.com/ashimdahal/clip-shift-analysis}{https://github.com/ashimdahal/clip-shift-analysis}.
Chinese: 本研究分析了九种常见图像增强技术对CLIP嵌入表示的影响,发现噪声、透视变换和尺度缩放会导致最显著的嵌入偏移,为视觉语言模型的机制可解释性和鲁棒性研究奠定了基础。
English: This study analyzes how nine common image augmentations affect CLIP's embedding shifts, revealing that noise, perspective transforms, and scaling cause the most significant changes, thereby advancing mechanistic interpretability and robustness in vision-language models.

Authors:Haofei Kuang, Yue Pan, Xingguang Zhong, Louis Wiesmann, Jens Behley, Cyrill Stachniss
Title: Improving Indoor Localization Accuracy by Using an Efficient Implicit Neural Map Representation
Abstract:
Globally localizing a mobile robot in a known map is often a foundation for enabling robots to navigate and operate autonomously. In indoor environments, traditional Monte Carlo localization based on occupancy grid maps is considered the gold standard, but its accuracy is limited by the representation capabilities of the occupancy grid map. In this paper, we address the problem of building an effective map representation that allows to accurately perform probabilistic global localization. To this end, we propose an implicit neural map representation that is able to capture positional and directional geometric features from 2D LiDAR scans to efficiently represent the environment and learn a neural network that is able to predict both, the non-projective signed distance and a direction-aware projective distance for an arbitrary point in the mapped environment. This combination of neural map representation with a light-weight neural network allows us to design an efficient observation model within a conventional Monte Carlo localization framework for pose estimation of a robot in real time. We evaluated our approach to indoor localization on a publicly available dataset for global localization and the experimental results indicate that our approach is able to more accurately localize a mobile robot than other localization approaches employing occupancy or existing neural map representations. In contrast to other approaches employing an implicit neural map representation for 2D LiDAR localization, our approach allows to perform real-time pose tracking after convergence and near real-time global localization. The code of our approach is available at: https://github.com/PRBonn/enm-mcl.
Chinese Summary: 本文提出了一种隐式神经地图表示方法,通过神经网络从激光雷达数据中预测几何特征,改进了蒙特卡洛定位算法,实现了室内环境中实时且精确的机器人全局定位。
English Summary: This paper introduces an implicit neural map representation that enhances Monte Carlo localization by using neural networks to predict geometric features from LiDAR data, enabling real-time and accurate global robot localization in indoor environments.

Authors:Ivan Anokhin, Rishav Rishav, Matthew Riemer, Stephen Chung, Irina Rish, Samira Ebrahimi Kahou
Title: Handling Delay in Real-Time Reinforcement Learning
Abstract:
Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of $τ$, an $N$-layer feed-forward network experiences observation delay of $τN$. Reducing the number of layers can decrease this delay, but at the cost of the network's expressivity. In this work, we explore the trade-off between minimizing delay and network's expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.
中文摘要:本研究通过引入时间跳跃连接和历史增强观测的方法,解决了实时强化学习中观测延迟与网络表达能力之间的权衡问题,该方案在不同环境中均表现出色,并通过并行神经元计算实现了显著的推理加速。
English Summary: This study addresses the trade-off between observational delay and network expressivity in real-time reinforcement learning by proposing a solution using temporal skip connections and history-augmented observations, which achieves robust performance across various settings and enables significant inference acceleration through parallel neuron computation.

Authors:Chenglong Lu, Shen Liang, Xuewei Wang, Wei Wang
Title: Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach
Abstract:
Vision Transformers (ViTs) have computational costs scaling quadratically with the number of tokens, calling for effective token pruning policies. Most existing policies are handcrafted, lacking adaptivity to varying inputs. Moreover, they fail to consider the sequential nature of token pruning across multiple layers. In this work, for the first time (as far as we know), we exploit Reinforcement Learning (RL) to data-adaptively learn a pruning policy. Formulating token pruning as a sequential decision-making problem, we model it as a Markov Game and utilize Multi-Agent Proximal Policy Optimization (MAPPO) where each agent makes an individualized pruning decision for a single token. We also develop reward functions that enable simultaneous collaboration and competition of these agents to balance efficiency and accuracy. On the well-known ImageNet-1k dataset, our method improves the inference speed by up to 44% while incurring only a negligible accuracy drop of 0.4%. The source code is available at https://github.com/daashuai/rl4evit.
视觉变换器因令牌数量呈二次方增长而计算成本高昂,为此我们采用强化学习中的MAPPO方法自适应修剪令牌,在几乎不影响精度的情况下将推理速度提升44%。
Vision Transformers face high computational costs due to quadratic token scaling, so we introduce a reinforcement learning approach using MAPPO to adaptively prune tokens, boosting inference speed by 44% with minimal accuracy loss.

Authors:Maofu Liu, Xin Jiang, Xiaokang Zhang
Title: CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation
Abstract:
Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at https://github.com/zxk688.
中文摘要:该研究提出CADFormer模型,通过细粒度跨模态对齐和解码技术,结合语言特征增强遥感图像分割的准确性,并在新的高分辨率数据集上验证了其优越性能。
English Summary: The study introduces CADFormer, a Transformer-based model with fine-grained cross-modal alignment and decoding to enhance Referring Remote Sensing Image Segmentation by improving object-level correspondence and integrating language features, validated on a new high-resolution dataset.

Authors:Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua
Title: Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Abstract:
Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.
Chinese: 本文提出了一种名为“剪枝与合并”的高效令牌压缩方法,通过可训练的合并与重构矩阵及梯度加权注意力机制,在ImageNet-1k和ADE20K数据集上实现了显著加速且精度损失极小。
English: This paper introduces Prune and Merge, an efficient token compression method for transformers that integrates pruning and merging with trainable matrices and a gradient-weighted attention mechanism, achieving significant speed-ups on ImageNet-1k and ADE20K with minimal accuracy loss.

Authors:Maofu Liu, Jiahui Liu, Xiaokang Zhang
Title: Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
Abstract:
Remote sensing image captioning aims to generate semantically accurate descriptions that are closely linked to the visual features of remote sensing images. Existing approaches typically emphasize fine-grained extraction of visual features and capturing global information. However, they often overlook the complementary role of textual information in enhancing visual semantics and face challenges in precisely locating objects that are most relevant to the image context. To address these challenges, this paper presents a semantic-spatial feature fusion with dynamic graph refinement (SFDR) method, which integrates the semantic-spatial feature fusion (SSFF) and dynamic graph feature refinement (DGFR) modules. The SSFF module utilizes a multi-level feature representation strategy by leveraging pre-trained CLIP features, grid features, and ROI features to integrate rich semantic and spatial information. In the DGFR module, a graph attention network captures the relationships between feature nodes, while a dynamic weighting mechanism prioritizes objects that are most relevant to the current scene and suppresses less significant ones. Therefore, the proposed SFDR method significantly enhances the quality of the generated descriptions. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method. The source code will be available at https://github.com/zxk688}{https://github.com/zxk688.
中文: 本文提出SFDR方法,通过语义空间特征融合和动态图优化模块,有效结合视觉与文本信息,显著提升了遥感图像描述的准确性。
English: This paper introduces the SFDR method, which combines semantic-spatial feature fusion and dynamic graph refinement to enhance remote sensing image captioning by better integrating visual and textual information for more accurate descriptions.

Authors:Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha
Title: VideoGen-Eval: Agent-based System for Video Generation Evaluation
Abstract:
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
中文摘要:视频生成技术的快速发展凸显了现有评估系统的不足,为此我们提出了VideoGen-Eval评估系统,该系统整合了大语言模型和多模态大语言模型技术,通过结构化提示和专用工具实现动态灵活的视频生成评估,并构建包含丰富样本的基准数据集验证了其与人类评估的高度一致性。
English Summary: The rapid progress in video generation has exposed limitations in current evaluation systems, leading to the development of VideoGen-Eval, an agent-based system that integrates LLM and MLLM technologies with specialized tools to provide dynamic and human-aligned assessments, validated through extensive experiments on a comprehensive benchmark.

Authors:Aimira Baitieva, Yacine Bouaouni, Alexandre Briot, Dick Ameln, Souhaiel Khalfaoui, Samet Akcay
Title: Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection
Abstract:
Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This field of computer vision is rapidly evolving, with increasing attention towards real-world applications. Meanwhile, popular datasets are typically produced in controlled lab environments with artificially created defects, unable to capture the diversity of real production conditions. New methods often fail in production settings, showing significant performance degradation or requiring impractical computational resources. This disconnect between academic results and industrial viability threatens to misdirect visual anomaly detection research. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data, (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications, and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap. The code is publicly available at https://github.com/abc-125/viad-benchmark
中文摘要:本文通过建立真实场景基准、公平评估先进方法并提供全面分析,旨在弥合视觉异常检测研究中学术成果与工业应用之间的鸿沟。
English Summary: This paper addresses the disconnect between academic anomaly detection research and industrial needs by establishing real-world benchmarks, fairly evaluating state-of-the-art methods, and providing comprehensive analysis to bridge the gap between theory and practical applications.

Authors:Xin Zuo, Jiaran Jiang, Jifeng Shen, Wankou Yang
Title: Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention
Abstract:
Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer-based framework for semantic segmentation of low-quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self-attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi-scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high-level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model's learning of underwater object edges and improve the model's prediction accuracy. Experiments conducted on the SUIM and DUT-USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at https://github.com/SAWRJJ/UWSegFormer.
中文: UWSegFormer框架通过引入水下图像质量注意力、多尺度聚合模块和边缘学习损失,有效解决了水下低照度导致的语义分割边界模糊问题,在多个数据集上取得了最优性能。
English: The proposed UWSegFormer framework enhances underwater semantic segmentation by integrating quality attention, multi-scale feature aggregation, and edge learning loss to address low illumination challenges, achieving state-of-the-art performance on benchmark datasets.

Authors:Ximu Zeng, Liwei Deng, Penghao Chen, Xu Chen, Han Su, Kai Zheng
Title: LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search
Abstract:
Approximate nearest neighbor search is fundamental in information retrieval. Previous partition-based methods enhance search efficiency by probing partial partitions, yet they face two common issues. In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids, which inevitably probes irrelevant partitions as it ignores data distribution. In the partition construction phase, all partition-based methods face the boundary problem that separates a query's nearest neighbors to multiple partitions, resulting in a long-tailed kNN distribution and degrading the optimal nprobe (i.e., the number of probing partitions). To address this gap, we propose LIRA, a LearnIng-based queRy-aware pArtition framework. Specifically, we propose a probing model to directly probe the partitions containing the kNN of a query, which can reduce probing waste and allow for query-aware probing with nprobe individually. Moreover, we incorporate the probing model into a learning-based redundancy strategy to mitigate the adverse impact of the long-tailed kNN distribution on search efficiency. Extensive experiments on real-world vector datasets demonstrate the superiority of LIRA in the trade-off among accuracy, latency, and query fan-out. The codes are available at https://github.com/SimoneZeng/LIRA-ANN-search.
中文:LIRA是一种基于学习的查询感知分区框架,通过直接定位相关分区并缓解长尾kNN分布,有效提升了近似最近邻搜索的效率和准确性。
English: LIRA is a learning-based query-aware partition framework that improves approximate nearest neighbor search by directly targeting relevant partitions and mitigating the long-tailed kNN distribution, enhancing efficiency and accuracy.

Authors:Haiduo Huang, Yadong Zhang, Pengju Ren
Title: KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters
Abstract:
Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We also observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived ``child" layers generated from a shared ``parent" convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves state-of-the-art accuracy-efficiency balance among dynamic convolution variants. Our codes are available at https://github.com/haiduo/KernelDNA.
中文: KernelDNA提出了一种轻量级卷积插件,通过跨层权重共享和基于适配器的调制实现动态核适应,在不改变标准卷积结构的情况下达到了最优的精度-效率平衡。
English: KernelDNA introduces a lightweight convolution plug-in that enables dynamic kernel adaptation through cross-layer weight sharing and adapter-based modulation, achieving state-of-the-art accuracy-efficiency balance without altering standard convolution structures.

Authors:Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, Luca Benini
Title: FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Abstract:
Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7$\times$ with negligible performance drop of <1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at https://github.com/csguoh/FastVAR.
Chinese: FastVAR提出了一种训练后加速方法,通过缓存令牌剪枝策略仅在前向传播中处理关键令牌,显著降低了大规模步骤的延迟,实现了2.7倍加速且性能损失可忽略。
English: FastVAR introduces a post-training acceleration method that uses cached token pruning to significantly reduce latency by forwarding only pivotal tokens at large-scale steps, achieving a 2.7× speedup with minimal performance loss.

Authors:Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, Guo Chen
Title: EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
Abstract:
Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.
Chinese: EagleVision是专为遥感领域设计的多模态大语言模型,通过引入属性解耦模块和新数据集,在目标检测和属性理解方面取得突破,实现了细粒度任务的最优性能。
English: EagleVision is a multimodal large language model specifically designed for remote sensing that overcomes challenges in object detection and attribute understanding by introducing an Attribute Disentangle module and a new dataset, achieving state-of-the-art performance in fine-grained tasks.

Authors:Hyunsik Jeon, Satoshi Koide, Yu Wang, Zhankui He, Julian McAuley
Title: LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation
Abstract:
Conversational recommender systems engage users in dialogues to refine their needs and provide more personalized suggestions. Although textual information suffices for many domains, visually driven categories such as fashion or home decor potentially require detailed visual information related to color, style, or design. To address this challenge, we propose LaViC (Large Vision-Language Conversational Recommendation Framework), a novel approach that integrates compact image representations into dialogue-based recommendation systems. LaViC leverages a large vision-language model in a two-stage process: (1) visual knowledge self-distillation, which condenses product images from hundreds of tokens into a small set of visual tokens in a self-distillation manner, significantly reducing computational overhead, and (2) recommendation prompt tuning, which enables the model to incorporate both dialogue context and distilled visual tokens, providing a unified mechanism for capturing textual and visual features. To support rigorous evaluation of visually-aware conversational recommendation, we construct a new dataset by aligning Reddit conversations with Amazon product listings across multiple visually oriented categories (e.g., fashion, beauty, and home). This dataset covers realistic user queries and product appearances in domains where visual details are crucial. Extensive experiments demonstrate that LaViC significantly outperforms text-only conversational recommendation methods and open-source vision-language baselines. Moreover, LaViC achieves competitive or superior accuracy compared to prominent proprietary baselines (e.g., GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), demonstrating the necessity of explicitly using visual data for capturing product attributes and showing the effectiveness of our vision-language integration. Our code and dataset are available at https://github.com/jeon185/LaViC.
中文:LaViC框架通过两阶段流程整合精炼视觉标记与对话上下文,在视觉导向领域显著优于纯文本推荐方法,并与主流专有模型达到同等或更高准确率。
English: The LaViC framework enhances conversational recommender systems by integrating distilled visual tokens and dialogue context through a two-stage process, significantly outperforming text-only methods and achieving competitive accuracy with proprietary models in visually-oriented domains.

Authors:Lu Yu, Haoyu Han, Zhe Tao, Hantao Yao, Changsheng Xu
Title: Language Guided Concept Bottleneck Models for Interpretable Continual Learning
Abstract:
Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the models ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning.
中文: 本文提出了一种新颖的持续学习框架,通过整合语言引导的概念瓶颈模型来学习跨任务的人类可理解概念,不仅有效缓解灾难性遗忘并提升模型性能(在ImageNet子集上准确率最高提升3.06%),同时提供了透明的决策解释能力。
English: This paper introduces a novel continual learning framework that integrates language-guided Concept Bottleneck Models to simultaneously mitigate catastrophic forgetting and enhance interpretability by learning human-understandable concepts across tasks, achieving superior performance with up to 3.06% accuracy improvement on ImageNet-subset.

Authors:Björn Möller, Lucas Görnhardt, Tim Fingscheidt
Title: A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images Only
Abstract:
Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: https://github.com/ifnspaml/SuperResolutionMultiscaleTraining.
中文: 本研究提出了一种轻量级视觉变换器及多尺度训练方法,用于无监督单图像超分辨率任务,在仅使用低分辨率图像训练的条件下超越了现有基于卷积神经网络的方法性能。
English: This study introduces a lightweight vision transformer with a multi-scale training method for unsupervised single-image super-resolution, achieving superior performance on LR-only benchmarks compared to existing CNN-based approaches.

Authors:Reza Esfandiarpoor, George Zerveas, Ruochen Zhang, Macton Mgonzo, Carsten Eickhoff, Stephen H. Bach
Title: Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance
Abstract:
Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.
中文摘要:本研究提出了一种新方法,通过使用大语言模型生成具有分级相关性的全合成文档,结合列表式Wasserstein损失函数训练密集检索器,该方法显著优于传统对比学习方法,在达到与真实标注数据训练相当性能的同时,对分布偏移表现出更强的鲁棒性。
English Summary: This study introduces a novel approach to training dense retrievers by generating fully synthetic documents with graduated relevance levels using LLMs, which, combined with a list-wise Wasserstein loss, significantly outperforms traditional contrastive learning methods and matches the performance of training on real labeled data while offering greater robustness to distribution shifts.

Authors:Alessio Borgi, Luca Maiano, Irene Amerini
Title: Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation
Abstract:
We introduce Z-SASLM, a Zero-Shot Style-Aligned SLI (Spherical Linear Interpolation) Blending Latent Manipulation pipeline that overcomes the limitations of current multi-style blending methods. Conventional approaches rely on linear blending, assuming a flat latent space leading to suboptimal results when integrating multiple reference styles. In contrast, our framework leverages the non-linear geometry of the latent space by using SLI Blending to combine weighted style representations. By interpolating along the geodesic on the hypersphere, Z-SASLM preserves the intrinsic structure of the latent space, ensuring high-fidelity and coherent blending of diverse styles - all without the need for fine-tuning. We further propose a new metric, Weighted Multi-Style DINO ViT-B/8, designed to quantitatively evaluate the consistency of the blended styles. While our primary focus is on the theoretical and practical advantages of SLI Blending for style manipulation, we also demonstrate its effectiveness in a multi-modal content fusion setting through comprehensive experimental studies. Experimental results show that Z-SASLM achieves enhanced and robust style alignment. The implementation code can be found at: https://github.com/alessioborgi/Z-SASLM.
中文: Z-SASLM提出了一种基于球面线性插值的零样本风格混合方法,通过沿超球面测地线插值克服传统线性混合缺陷,无需微调即可实现高质量多风格融合,并设计了新的量化评估指标。
English: Z-SASLM introduces a zero-shot style blending pipeline using spherical linear interpolation to overcome linear blending limitations, achieving high-fidelity multi-style fusion without fine-tuning while proposing a new metric for quantitative evaluation.

Authors:Yiqian Wu, Yujie Liu, Yi Yin, Muhan Zeng, Zhentao Ye, Xin Zhang, Yingfei Xiong, Lu Zhang
Title: SmartFL: Semantics Based Probabilistic Fault Localization
Abstract:
Testing-based fault localization has been a research focus in software engineering in the past decades. It localizes faulty program elements based on a set of passing and failing test executions. Since whether a fault could be triggered and detected by a test is related to program semantics, it is crucial to model program semantics in fault localization approaches. Existing approaches either consider the full semantics of the program (e.g., mutation-based fault localization and angelic debugging), leading to scalability issues, or ignore the semantics of the program (e.g., spectrum-based fault localization), leading to imprecise localization results. Our key idea is: by modeling only the correctness of program values but not their full semantics, a balance could be reached between effectiveness and scalability. To realize this idea, we introduce a probabilistic model by efficient approximation of program semantics and several techniques to address scalability challenges. Our approach, SmartFL(SeMantics bAsed pRobabilisTic Fault Localization), is evaluated on a real-world dataset, Defects4J 2.0. The top-1 statement-level accuracy of our approach is {14\%}, which improves 130\% over the best SBFL and MBFL methods. The average time cost is {205} seconds per fault, which is half of SBFL methods. After combining our approach with existing approaches using the CombineFL framework, the performance of the combined approach is significantly boosted by an average of 10\% on top-1, top-3, and top-5 accuracy compared to state-of-the-art combination methods.
中文: SmartFL通过概率模型近似程序语义,在故障定位的准确性与可扩展性之间取得平衡,其准确率比现有最佳方法提升130%,同时将平均耗时减半。
English: SmartFL introduces a probabilistic model that approximates program semantics to balance effectiveness and scalability, achieving a 130% improvement in fault localization accuracy over existing methods while reducing time costs by half.

Authors:Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander
Title: Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
Abstract:
The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher
中文: 作者提出DINO Teacher方法,利用大型冻结DINOv2骨干网络生成更准确的目标域标签,并将学生模型特征与DINO表征对齐,在多个域自适应目标检测数据集上实现了最先进的性能。
English: The authors propose DINO Teacher, a novel method for domain adaptive object detection that leverages a large frozen DINOv2 backbone to generate more accurate target domain labels and align student features with DINO representations, achieving state-of-the-art performance on multiple datasets.

Authors:Vincent Gbouna Zakka, Zhuangzhuang Dai, Luis J. Manso
Title: Action Recognition in Real-World Ambient Assisted Living Environment
Abstract:
The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: https://github.com/Gbouna/RE-TCN
Chinese: 本文针对环境辅助生活(AAL)中行为识别的挑战,提出了鲁棒高效时序卷积网络(RE-TCN),通过自适应时序加权和深度可分离卷积等组件提升精度、抗噪性和计算效率,并在多个基准数据集上验证了其优越性能。
English: To address the challenges of action recognition in Ambient Assisted Living (AAL), this paper proposes the Robust and Efficient Temporal Convolution network (RE-TCN), which enhances accuracy, noise robustness, and computational efficiency through innovative components and has been validated on multiple benchmark datasets.

Authors:Pengyu Chen, Sicheng Wang, Cuizhen Wang, Senrong Wang, Beiao Huang, Lu Huang, Zhe Zang
Title: A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery
Abstract:
Precise detection of rooftops from historical aerial imagery is essential for analyzing long-term urban development and human settlement patterns. Nonetheless, black-and-white analog photographs present considerable challenges for modern object detection frameworks due to their limited spatial resolution, absence of color information, and archival degradation. To address these challenges, this research introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization utilizing DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were subsequently employed to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. The results demonstrate that the combination of colorization with super-resolution significantly enhances detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85\%. This signifies an enhancement of approximately 40\% over the original black-and-white images and 20\% over images enhanced solely through colorization. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, facilitating more reliable extraction of building footprints from historical aerial photographs. Code and resources for reproducing our results are publicly available at \href{https://github.com/Pengyu-gis/Historical-Aerial-Photos}{github.com/Pengyu-gis/Historical-Aerial-Photos}.
中文: 本研究提出了一种基于生成对抗网络的两阶段图像增强方法,通过着色和超分辨率处理显著提升了历史航拍图像中屋顶检测的精度,使YOLOv11n模型的平均精度均值超过85%,有效弥合了档案影像与现代深度学习技术之间的差距。
English: This study develops a two-stage GAN-based enhancement method combining colorization and super-resolution to significantly improve rooftop detection in historical aerial images, achieving over 85% mAP with YOLOv11n and bridging archival imagery with modern deep learning techniques.

Authors:Shota Hirose, Kazuki Kotoyori, Kasidis Arunruangsirilert, Fangzheng Lin, Heming Sun, Jiro Katto
Title: Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training
Abstract:
Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at http://bit.ly/IFRVPDemo. The code will be released at https://github.com/FykAikawa/IFRVP.
中文摘要:本文提出IFRVP实时视频预测方法,通过改进帧插值技术和引入ELAN残差块,在预测精度与计算速度之间实现了最佳平衡,致力于实现网络零延迟交互。
English Summary: This paper introduces IFRVP, a real-time video prediction method that uses enhanced frame interpolation and ELAN-based residual blocks to achieve optimal balance between accuracy and computational efficiency for zero-latency network interactions.

Authors:Zewen Liu, Xiaoda Wang, Bohan Wang, Zijie Huang, Carl Yang, Wei Jin
Title: Graph ODEs and Beyond: A Comprehensive Survey on Integrating Differential Equations with Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) and differential equations (DEs) are two rapidly advancing areas of research that have shown remarkable synergy in recent years. GNNs have emerged as powerful tools for learning on graph-structured data, while differential equations provide a principled framework for modeling continuous dynamics across time and space. The intersection of these fields has led to innovative approaches that leverage the strengths of both, enabling applications in physics-informed learning, spatiotemporal modeling, and scientific computing. This survey aims to provide a comprehensive overview of the burgeoning research at the intersection of GNNs and DEs. We will categorize existing methods, discuss their underlying principles, and highlight their applications across domains such as molecular modeling, traffic prediction, and epidemic spreading. Furthermore, we identify open challenges and outline future research directions to advance this interdisciplinary field. A comprehensive paper list is provided at https://github.com/Emory-Melody/Awesome-Graph-NDEs. This survey serves as a resource for researchers and practitioners seeking to understand and contribute to the fusion of GNNs and DEs
中文摘要:本综述系统探讨图神经网络与微分方程的协同融合,分类研究方法与应用领域,并为这一交叉学科指明未来发展方向。
English Summary: This survey comprehensively explores the synergistic integration of Graph Neural Networks and differential equations, categorizing methods and applications while identifying future research directions for this interdisciplinary field.

Authors:Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken
Title: CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis
Abstract:
Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/CodeARC
中文: CodeARC提出了一个交互式评估框架,用于归纳程序合成,智能体通过隐藏目标函数的反馈迭代优化方案,实验表明最佳模型成功率仅达52.7%,而微调可带来显著性能提升。
English: CodeARC introduces an interactive evaluation framework for inductive program synthesis, where agents iteratively refine functions using feedback from a hidden target, with experiments showing top-performing models achieving 52.7% success and fine-tuning yielding significant improvements.

Authors:Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Title: LSNet: See Large, Focus Small
Abstract:
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.
中文摘要:本文提出LSNet轻量视觉网络,采用"见大聚焦小"策略,通过LS卷积结合大核感知与小核聚合,在多种视觉任务中实现了优于现有轻量模型的性能与效率。
English Summary: The paper introduces LSNet, a lightweight vision network that employs a novel "See Large, Focus Small" strategy through LS convolution, combining large-kernel perception with small-kernel aggregation to achieve superior efficiency and performance across various vision tasks.

Authors:Alexander Vogel, Omar Moured, Yufan Chen, Jiaming Zhang, Rainer Stiefelhagen
Title: RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
Abstract:
Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.
Chinese: RefChartQA提出了一个结合图表问答与视觉定位的新基准,通过明确识别空间元素将模型准确率提升超过15%,解决了现有方法忽视图表视觉证据的不足。
English: RefChartQA introduces a novel benchmark combining chart question answering with visual grounding to enhance model accuracy by over 15% through explicit spatial element identification, addressing gaps in current methods that overlook visual evidence in charts.

Authors:Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, Wei-Shi Zheng
Title: Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation
Abstract:
We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.
Chinese: 我们提出了一种高效的显式关节级交互模型(EJIM),该方法能够以高计算效率对文本引导的人体-物体交互进行精细建模,在仅使用5%推理时间的情况下,大幅超越了现有方法的性能。
English: We introduce the Efficient Explicit Joint-level Interaction Model (EJIM), a novel method that effectively models text-guided human-object interactions at the joint level with high computational efficiency, significantly outperforming previous approaches while using only 5% of the inference time.

Authors:Xiaolu Liu, Ruizi Yang, Song Wang, Wentong Li, Junbo Chen, Jianke Zhu
Title: Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction
Abstract:
Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. Although recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV-learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on the nuScenes dataset. Source code will be available at https://github.com/xiaolul2/UIGenMap.
Chinese: UIGenMap通过不确定性引导的结构注入方法,利用动态分布重采样和显式实例特征来提升高精地图的泛化能力,在nuScenes数据集上实现了mAP指标5.7的性能提升,有效保障自动驾驶的行车安全。
English: UIGenMap enhances autonomous vehicle safety by introducing an uncertainty-instructed structure injection method that improves generalization across diverse driving scenes through dynamic uncertainty resampling and explicit feature integration, achieving a +5.7 mAP boost on nuScenes.

Authors:Paul Caillon, Erwan Fagnou, Alexandre Allauzen
Title: Fast Training of Recurrent Neural Networks with Stationary State Feedbacks
Abstract:
Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.
中文: 本文提出一种基于状态空间模型的固定梯度反馈方法,替代循环神经网络中计算成本高的时间反向传播,以降低训练成本的同时保持竞争力。
English: This paper introduces a fixed gradient feedback method based on state-space model principles to replace the computationally expensive backpropagation through time in recurrent neural networks, achieving competitive performance with reduced training costs.

Authors:Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, Bryan Hooi, Stan Z. Li, Keqin Li
Title: Efficient Inference for Large Reasoning Models: A Survey
Abstract:
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in solving complex tasks. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. The overview structure of this paper is shown in Figure~\ref{fig:paper_structure}. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from reasoning scenarios, object functions, and performance \& efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring the safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs' inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field. A collection of efficient reasoning methods for LRMs (papers and codes) is provided at this link: https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs.
中文: 本综述针对大型推理模型在推理过程中存在的令牌低效问题,系统评述了保持推理质量的高效推理方法,将其分为显式紧凑思维链和隐式潜在思维链两类,并探讨了其优劣、实证分析及未来挑战。
English: This survey reviews efficient inference methods for Large Reasoning Models (LRMs) to address token inefficiency while maintaining reasoning quality, categorizing approaches into explicit compact Chain-of-Thought and implicit latent CoT while analyzing their trade-offs and future challenges.

Authors:Yuyang Liang, Yankai Chen, Yixiang Fang, Laks V. S. Lakshmanan, Chenhao Ma
Title: TRACE: Intra-visit Clinical Event Nowcasting via Effective Patient Trajectory Encoding
Abstract:
Electronic Health Records (EHR) have become a valuable resource for a wide range of predictive tasks in healthcare. However, existing approaches have largely focused on inter-visit event predictions, overlooking the importance of intra-visit nowcasting, which provides prompt clinical insights during an ongoing patient visit. To address this gap, we introduce the task of laboratory measurement prediction within a hospital visit. We study the laboratory data that, however, remained underexplored in previous work. We propose TRACE, a Transformer-based model designed for clinical event nowcasting by encoding patient trajectories. TRACE effectively handles long sequences and captures temporal dependencies through a novel timestamp embedding that integrates decay properties and periodic patterns of data. Additionally, we introduce a smoothed mask for denoising, improving the robustness of the model. Experiments on two large-scale electronic health record datasets demonstrate that the proposed model significantly outperforms previous methods, highlighting its potential for improving patient care through more accurate laboratory measurement nowcasting. The code is available at https://github.com/Amehi/TRACE.
中文摘要:该研究提出TRACE模型,一种基于Transformer的临床事件即时预测方法,通过处理长序列数据和整合时间衰减特性与周期模式的新型时间戳嵌入,显著提升了医院就诊期间实验室测量的预测准确性。
English Summary: The study introduces TRACE, a Transformer-based model for intra-visit laboratory measurement nowcasting in Electronic Health Records, which outperforms existing methods by handling long sequences and capturing temporal dependencies through innovative timestamp embeddings and denoising techniques.

Authors:Zijun Ding, Mingdie Xiong, Congcong Zhu, Jingrun Chen
Title: STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
Abstract:
Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
中文: 提出的时空语义对齐(STSA)方法通过双路径对齐机制和概率热图来协调空间与时间语义特征,有效稳定了面部运动合成,显著提升了图像质量与合成稳定性。
English: The proposed Spatial-Temporal Semantic Alignment (STSA) method stabilizes facial motion synthesis by aligning spatial and temporal semantic features through dual-path alignment and probabilistic heatmaps, significantly improving image quality and stability.

Authors:Ziang Lu, Lei Guo, Xu Yu, Zhiyong Cheng, Xiaohui Han, Lei Zhu
Title: Federated Semantic Learning for Privacy-preserving Cross-domain Recommendation
Abstract:
In the evolving landscape of recommender systems, the challenge of effectively conducting privacy-preserving Cross-Domain Recommendation (CDR), especially under strict non-overlapping constraints, has emerged as a key focus. Despite extensive research has made significant progress, several limitations still exist: 1) Previous semantic-based methods fail to deeply exploit rich textual information, since they quantize the text into codes, losing its original rich semantics. 2) The current solution solely relies on the text-modality, while the synergistic effects with the ID-modality are ignored. 3) Existing studies do not consider the impact of irrelevant semantic features, leading to inaccurate semantic representation. To address these challenges, we introduce federated semantic learning and devise FFMSR as our solution. For Limitation 1, we locally learn items'semantic encodings from their original texts by a multi-layer semantic encoder, and then cluster them on the server to facilitate the transfer of semantic knowledge between domains. To tackle Limitation 2, we integrate both ID and Text modalities on the clients, and utilize them to learn different aspects of items. To handle Limitation 3, a Fast Fourier Transform (FFT)-based filter and a gating mechanism are developed to alleviate the impact of irrelevant semantic information in the local model. We conduct extensive experiments on two real-world datasets, and the results demonstrate the superiority of our FFMSR method over other SOTA methods. Our source codes are publicly available at: https://github.com/Sapphire-star/FFMSR.
中文摘要:本文提出FFMSR方法,通过联邦语义学习整合ID与文本模态,采用多层语义编码器和快速傅里叶变换过滤机制,有效解决跨域推荐中语义信息利用不足和隐私保护问题。
English Summary: This paper introduces FFMSR, a federated semantic learning method that addresses limitations in cross-domain recommendation by integrating ID and text modalities, using a multi-layer semantic encoder and FFT-based filtering to enhance semantic representation and privacy preservation.

Authors:Beibei Wang, Boyue Cui, Shiqu Chen, Xuan Wang, Yadong Wang, Junyi Li
Title: MSNGO: multi-species protein function annotation based on 3D protein structure and network propagation
Abstract:
Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross-species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and PPI networks. Availability: https://github.com/blingbell/MSNGO.
中文:MSNGO模型通过整合AlphaFold2预测的结构特征与序列数据,显著提升了跨物种蛋白质功能预测的准确性,优于以往依赖序列特征和蛋白质相互作用网络的方法。
English: The MSNGO model enhances multi-species protein function prediction by integrating structural features from AlphaFold2-derived contact maps with sequence data, achieving superior accuracy over traditional sequence-based and PPI network methods.

Authors:Gabriel Recchia, Chatrik Singh Mangat, Issac Li, Gayatri Krishnakumar
Title: FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
Abstract:
As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.
中文: 为解决可扩展AI监督中专家标注数据稀缺的问题,FindTheFlaws数据集集合提供了五个涵盖多领域的标注数据集,包含已验证解决方案和错误标注,既能评估模型的批判能力,又可通过让较弱模型验证较强模型来支持可扩展监督实验。
English: To address the scarcity of expert-annotated datasets for scalable AI oversight, the FindTheFlaws collection provides five diverse datasets with validated solutions and error annotations, enabling evaluation of model critiquing capabilities and supporting scalable oversight experiments by pairing weaker models as verifiers for stronger ones.

Authors:Peiyu Chen, Fuling Lin, Weipeng Guan, Peng Lu
Title: SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry
Abstract:
Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions. As opposed to frame-based cameras, the motion-dependent nature of event cameras presents persistent challenges in achieving robust event feature detection and matching. In recent years, learning-based approaches have demonstrated superior robustness over traditional handcrafted methods in feature detection and matching, particularly under aggressive motion and HDR scenarios. In this paper, we propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve event-inertial odometry. Our event-only feature detection employs a convolutional neural network under continuous event streams. Moreover, our system adopts the graph neural network to achieve event descriptor matching for loop closure. The proposed system utilizes TensorRT to accelerate the inference speed of deep networks, which ensures low-latency processing and robust real-time operation on resource-limited platforms. Besides, we evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods. We have also open-sourced our pipeline to facilitate research in the field: https://github.com/arclab-hku/SuperEIO.
Chinese: 事件相机在高速运动和复杂光照条件下提供低延迟事件流,而SuperEIO框架通过基于学习的事件检测结合IMU数据,在资源受限平台上实现了实时、鲁棒的事件-惯性里程计。
English: Event cameras offer low-latency event streams ideal for state estimation in high-speed and challenging lighting conditions, and the proposed SuperEIO framework uses learning-based event-only detection with IMU data to achieve robust, real-time event-inertial odometry on resource-limited platforms.

Authors:Behrooz Moosavi Ramezanzadeh
Title: Optimal Control of an Epidemic with Intervention Design
Abstract:
In this paper, I propose a controlled SEIR model that advances epidemic management through optimal control theory. I improve the traditional framework by incorporating practical intervention constraints and economic considerations. Approaching this problem using modern methods of calculus of variations, I first conduct a rigorous mathematical analysis of the controlled system. Then, I formulate an infinite time horizon control problem and investigate its mathematical connections with finite time, setting the stage for applying the Hamiltonian procedure.
中文: 本文通过最优控制理论提出改进的SEIR传染病模型,结合实际干预和经济因素,采用变分法和哈密顿方法进行严谨数学分析。
English: This paper introduces an enhanced SEIR epidemic model using optimal control theory, incorporating practical interventions and economic factors through rigorous mathematical analysis and Hamiltonian methods.

Authors:Ke Zhang, Vishal M. Patel
Title: MedCL: Learning Consistent Anatomy Distribution for Scribble-supervised Medical Image Segmentation
Abstract:
Curating large-scale fully annotated datasets is expensive, laborious, and cumbersome, especially for medical images. Several methods have been proposed in the literature that make use of weak annotations in the form of scribbles. However, these approaches require large amounts of scribble annotations, and are only applied to the segmentation of regular organs, which are often unavailable for the disease species that fall in the long-tailed distribution. Motivated by the fact that the medical labels have anatomy distribution priors, we propose a scribble-supervised clustering-based framework, called MedCL, to learn the inherent anatomy distribution of medical labels. Our approach consists of two steps: i) Mix the features with intra- and inter-image mix operations, and ii) Perform feature clustering and regularize the anatomy distribution at both local and global levels. Combined with a small amount of weak supervision, the proposed MedCL is able to segment both regular organs and challenging irregular pathologies. We implement MedCL based on SAM and UNet backbones, and evaluate the performance on three open datasets of regular structure (MSCMRseg), multiple organs (BTCV) and irregular pathology (MyoPS). It is shown that even with less scribble supervision, MedCL substantially outperforms the conventional segmentation methods. Our code is available at https://github.com/BWGZK/MedCL.
中文摘要:MedCL框架通过特征混合与聚类技术,结合少量标注和医学先验知识,能够同时精准分割常规器官与不规则病灶,在有限监督下显著优于传统分割方法。
English Summary: The proposed MedCL framework leverages scribble annotations and anatomical priors through feature mixing and clustering to effectively segment both regular organs and irregular pathologies with minimal supervision, outperforming traditional methods.

Authors:Lauren Shrack, Timm Haucke, Antoine Salaün, Arjun Subramonian, Sara Beery
Title: Pairwise Matching of Intermediate Representations for Fine-grained Explainability
Abstract:
The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts found PAIR-X to be a meaningful improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: https://github.com/pairx-explains/pairx
中文:提出的PAIR-X方法通过结合模型激活和相关性评分生成细粒度、局部化的视觉解释,在动物和建筑物重识别任务中展现出优于现有技术的可解释性。
English: The proposed PAIR-X method generates fine-grained, localized visual explanations by combining model activations and relevance scores, demonstrating superior interpretability over existing techniques in animal and building re-identification tasks.

Authors:Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Abstract:
State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.
中文:Quamba2是一种适用于状态空间模型的多位宽量化框架,通过离线排序聚类和权重重排技术,在实现4倍内存压缩和显著加速的同时,仅造成1.6%的精度损失,有效支持不同部署场景的需求。
English: Quamba2 is a versatile quantization framework for State Space Models that supports multiple bit-width configurations to reduce memory usage and accelerate computation while maintaining accuracy, achieving significant speed-ups and memory savings with minimal performance loss.

Authors:Nina Weng, Aasa Feragen, Siavash Bigdeli
Title: Patronus: Bringing Transparency to Diffusion Models with Prototypes
Abstract:
Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus}.
中文: Patronus是一种可解释的扩散模型,它将原型网络集成到DDPM中,通过展示学习到的原型及其对生成过程的影响来提高透明度,同时支持图像操控和检测捷径学习,且无需任何标注。
English: Patronus is an interpretable diffusion model that integrates a prototypical network into DDPMs to enhance transparency by revealing learned prototypes and their influence on image generation, while enabling manipulation and detecting shortcut learning without requiring annotations.

Authors:Amr Alshatnawi, Remi Sampaleanu, David Liebovitz
Title: MediTools -- Medical Education Powered by LLMs
Abstract:
Artificial Intelligence (AI) has been advancing rapidly and with the advent of large language models (LLMs) in late 2022, numerous opportunities have emerged for adopting this technology across various domains, including medicine. These innovations hold immense potential to revolutionize and modernize medical education. Our research project leverages large language models to enhance medical education and address workflow challenges through the development of MediTools - AI Medical Education. This prototype application focuses on developing interactive tools that simulate real-life clinical scenarios, provide access to medical literature, and keep users updated with the latest medical news. Our first tool is a dermatology case simulation tool that uses real patient images depicting various dermatological conditions and enables interaction with LLMs acting as virtual patients. This platform allows users to practice their diagnostic skills and enhance their clinical decision-making abilities. The application also features two additional tools: an AI-enhanced PubMed tool for engaging with LLMs to gain deeper insights into research papers, and a Google News tool that offers LLM generated summaries of articles for various medical specialties. A comprehensive survey has been conducted among medical professionals and students to gather initial feedback on the effectiveness and user satisfaction of MediTools, providing insights for further development and refinement of the application. This research demonstrates the potential of AI-driven tools in transforming and revolutionizing medical education, offering a scalable and interactive platform for continuous learning and skill development.
中文: 本研究利用大语言模型开发了MediTools人工智能医学教育平台,通过临床模拟工具、智能文献检索和医学新闻摘要三大功能,为医学教育提供可扩展的交互式学习方案,展现了AI技术推动医学教育变革的巨大潜力。
English: This research utilizes large language models to develop MediTools, an AI-powered medical education platform featuring interactive clinical simulations, enhanced literature access, and real-time news summaries, demonstrating AI's transformative potential in modernizing medical training through scalable learning tools.

Authors:Yuying Duan, Gelei Xu, Yiyu Shi, Michael Lemmon
Title: The Cost of Local and Global Fairness in Federated Learning
Abstract:
With the emerging application of Federated Learning (FL) in finance, hiring and healthcare, FL models are regulated to be fair, preventing disparities with respect to legally protected attributes such as race or gender. Two concepts of fairness are important in FL: global and local fairness. Global fairness addresses the disparity across the entire population and local fairness is concerned with the disparity within each client. Prior fair FL frameworks have improved either global or local fairness without considering both. Furthermore, while the majority of studies on fair FL focuses on binary settings, many real-world applications are multi-class problems. This paper proposes a framework that investigates the minimum accuracy lost for enforcing a specified level of global and local fairness in multi-class FL settings. Our framework leads to a simple post-processing algorithm that derives fair outcome predictors from the Bayesian optimal score functions. Experimental results show that our algorithm outperforms the current state of the art (SOTA) with regard to the accuracy-fairness tradoffs, computational and communication costs. Codes are available at: https://github.com/papersubmission678/The-cost-of-local-and-global-fairness-in-FL .
中文摘要:本文提出一个联邦学习框架,在保证全局与局部公平性的同时最小化多分类场景中的精度损失,其提出的后处理算法在公平性-精度权衡及计算效率方面均优于现有方法。
English Summary: This paper introduces a framework for enforcing both global and local fairness in multi-class federated learning with minimal accuracy loss, featuring a post-processing algorithm that outperforms current methods in fairness-accuracy tradeoffs and efficiency.

Authors:Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, Ou Wu
Title: Data Poisoning in Deep Learning: A Survey
Abstract:
Deep learning has become a cornerstone of modern artificial intelligence, enabling transformative applications across a wide range of domains. As the core element of deep learning, the quality and security of training data critically influence model performance and reliability. However, during the training process, deep learning models face the significant threat of data poisoning, where attackers introduce maliciously manipulated training data to degrade model accuracy or lead to anomalous behavior. While existing surveys provide valuable insights into data poisoning, they generally adopt a broad perspective, encompassing both attacks and defenses, but lack a dedicated, in-depth analysis of poisoning attacks specifically in deep learning. In this survey, we bridge this gap by presenting a comprehensive and targeted review of data poisoning in deep learning. First, this survey categorizes data poisoning attacks across multiple perspectives, providing an in-depth analysis of their characteristics and underlying design princinples. Second, the discussion is extended to the emerging area of data poisoning in large language models(LLMs). Finally, we explore critical open challenges in the field and propose potential research directions to advance the field further. To support further exploration, an up-to-date repository of resources on data poisoning in deep learning is available at https://github.com/Pinlong-Zhao/Data-Poisoning.
中文摘要:本综述针对深度学习中的数据投毒攻击进行了系统梳理,从多角度分类分析其特性与设计原理,并延伸探讨大型语言模型中的投毒问题及未来研究方向。
English Summary: This survey provides a comprehensive review of data poisoning attacks in deep learning, categorizing them by characteristics and design principles while extending the discussion to large language models and identifying future research directions.

Authors:Gongzhu Yin, Hongli Zhang, Yi Luo, Yuchen Yang, Kun Lu, Chao Meng
Title: Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLMs in Temporal Knowledge Graph Forecasting
Abstract:
Temporal Knowledge Graph (TKG) forecasting is crucial for predicting future events using historical data. With the surge of Large Language Models (LLMs), recent studies have begun exploring their integration into TKG forecasting and achieved some success. However, they still face limitations such as limited input length, inefficient output generation, and resource-intensive refinement, which undermine their performance and practical applicability. To address these limitations, we introduce SPARK, a Sequence-level Proxy-Adapting framework for Refining LLMs in TKG forecasting. Inspired by inference-time algorithms adopted in controlling generation, SPARK offers a cost-effective, plug-and-play solution through two key innovations: (1) Beam Sequence-Level Generation, which reframes TKG forecasting as a top-K sequence-level generation task, using beam search for efficiently generating next-entity distribution in a single forward pass. (2) TKG Adapter for Refinement, which employs traditional TKG models as trainable proxy adapters to leverage global graph information and refine LLM outputs, overcoming both the input length and the resource-intensive fine-tuning problems. Experiments across diverse datasets validate SPARK's forecasting performance, robust generalization capabilities, and high efficiency. We release source codes at https://github.com/yin-gz/SPARK.
Chinese Summary: SPARK提出了一种经济高效的即插即用框架,通过集成束序列级生成和TKG适配器,有效解决了大型语言模型在时序知识图谱预测中的输入限制和效率低下等问题。
English Summary: SPARK introduces a cost-effective, plug-and-play framework that enhances TKG forecasting by integrating beam sequence-level generation and TKG adapters to overcome limitations of LLMs like input constraints and inefficiency.

Authors:Xu Yang, Rui Wang, Kaiwen Li, Wenhua Li, Tao Zhang, Fujun He
Title: PlatMetaX: An Integrated MATLAB platform for Meta-Black-Box Optimization
Abstract:
The landscape of optimization problems has become increasingly complex, necessitating the development of advanced optimization techniques. Meta-Black-Box Optimization (MetaBBO), which involves refining the optimization algorithms themselves via meta-learning, has emerged as a promising approach. Recognizing the limitations in existing platforms, we presents PlatMetaX, a novel MATLAB platform for MetaBBO with reinforcement learning. PlatMetaX integrates the strengths of MetaBox and PlatEMO, offering a comprehensive framework for developing, evaluating, and comparing optimization algorithms. The platform is designed to handle a wide range of optimization problems, from single-objective to multi-objective, and is equipped with a rich set of baseline algorithms and evaluation metrics. We demonstrate the utility of PlatMetaX through extensive experiments and provide insights into its design and implementation. PlatMetaX is available at: \href{https://github.com/Yxxx616/PlatMetaX}{https://github.com/Yxxx616/PlatMetaX}.
Chinese Summary: PlatMetaX是一个新颖的MATLAB平台,它整合了MetaBox和PlatEMO的优势,为基于强化学习的元黑盒优化算法开发、评估和比较提供了全面框架,并通过大量实验验证了其实用性。
English Summary: PlatMetaX is a novel MATLAB platform that integrates MetaBox and PlatEMO to provide a comprehensive framework for developing, evaluating, and comparing meta-black-box optimization algorithms using reinforcement learning, demonstrated through extensive experiments.

Authors:Sarah Martinson, Lingkai Kong, Cheol Woo Kim, Aparna Taneja, Milind Tambe
Title: LLM-based Agent Simulation for Maternal Health Interventions: Uncertainty Estimation and Decision-focused Evaluation
Abstract:
Agent-based simulation is crucial for modeling complex human behavior, yet traditional approaches require extensive domain knowledge and large datasets. In data-scarce healthcare settings where historic and counterfactual data are limited, large language models (LLMs) offer a promising alternative by leveraging broad world knowledge. This study examines an LLM-driven simulation of a maternal mobile health program, predicting beneficiaries' listening behavior when they receive health information via automated messages (control) or live representatives (intervention). Since uncertainty quantification is critical for decision-making in health interventions, we propose an LLM epistemic uncertainty estimation method based on binary entropy across multiple samples. We enhance model robustness through ensemble approaches, improving F1 score and model calibration compared to individual models. Beyond direct evaluation, we take a decision-focused approach, demonstrating how LLM predictions inform intervention feasibility and trial implementation in data-limited settings. The proposed method extends to public health, disaster response, and other domains requiring rapid intervention assessment under severe data constraints. All code and prompts used for this work can be found at https://github.com/sarahmart/LLM-ABS-ARMMAN-prediction.
中文: 本研究提出了一种基于大语言模型的模拟方法,用于预测母婴健康项目中受益者的行为,通过认知不确定性估计和集成技术,在数据稀缺的医疗环境中提升决策能力。
English: This study introduces an LLM-based simulation method for predicting beneficiary behavior in a maternal health program, incorporating epistemic uncertainty estimation and ensemble techniques to enhance decision-making in data-scarce healthcare settings.

Authors:Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, Jian Zhang
Title: Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Abstract:
Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. The rapid advancement of multi-modal large language models (MLLMs) has significantly broadened the scope of IQA, moving toward comprehensive image quality understanding that incorporates content analysis, degradation perception, and comparison reasoning beyond mere numerical scoring. Previous MLLM-based methods typically either generate numerical scores lacking interpretability or heavily rely on supervised fine-tuning (SFT) using large-scale annotated datasets to provide descriptive assessments, limiting their flexibility and applicability. In this paper, we propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO), which demonstrates strong visual reasoning capability for image quality understanding while requiring only a limited amount of rating scores and degradation labels. By jointly optimizing score regression and degradation perception tasks with carefully designed reward functions, our approach effectively exploits their mutual benefits for enhanced performance. Extensive experiments demonstrate that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks, while exhibiting impressive zero-shot generalization to comparison reasoning tasks. Code will be available at https://github.com/lwq20020127/Q-Insight.
Chinese: 本文提出Q-Insight模型,采用基于群体相对策略优化的强化学习方法,通过少量标注数据联合优化评分回归与退化感知任务,在图像质量评估中显著超越现有方法并展现出优秀的零样本推理能力。
English: This paper introduces Q-Insight, a reinforcement learning model using group relative policy optimization that advances image quality assessment by integrating score regression and degradation perception with minimal supervision, outperforming existing methods and showing strong zero-shot generalization.

Authors:Belinda Z. Li, Been Kim, Zi Wang
Title: QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Abstract:
Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.
中文: 最新研究提出QuestBench基准,用于评估大语言模型在信息不全的推理任务中识别最小必要问题的能力,发现即使最先进的模型在逻辑和规划问题上表现不佳,尽管它们在明确定义的任务中表现出色。
English: Recent research introduces QuestBench, a benchmark evaluating LLMs' ability to identify minimal necessary questions in underspecified reasoning tasks, revealing that even state-of-the-art models struggle with logic and planning problems despite excelling in well-defined scenarios.

Authors:Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Title: ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Abstract:
Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.
中文摘要:ActionStudio是一个轻量级可扩展框架,通过统一智能体轨迹、优化分布式训练流程,将吞吐量提升高达9倍,并开源了工具和包含9.8万条轨迹的数据集,显著提升大动作模型的训练效率。
English Summary: ActionStudio is a lightweight and extensible framework that enhances large action model training by unifying agent trajectories, optimizing distributed workflows, and achieving up to 9x higher throughput while releasing open-source tools and a 98k trajectory dataset.

Authors:Francesca Pezzuti, Sean MacAvaney, Nicola Tonellotto
Title: Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers
Abstract:
State-of-the-art cross-encoders can be fine-tuned to be highly effective in passage re-ranking. The typical fine-tuning process of cross-encoders as re-rankers requires large amounts of manually labelled data, a contrastive learning objective, and a set of heuristically sampled negatives. An alternative recent approach for fine-tuning instead involves teaching the model to mimic the rankings of a highly effective large language model using a distillation objective. These fine-tuning strategies can be applied either individually, or in sequence. In this work, we systematically investigate the effectiveness of point-wise cross-encoders when fine-tuned independently in a single stage, or sequentially in two stages. Our experiments show that the effectiveness of point-wise cross-encoders fine-tuned using contrastive learning is indeed on par with that of models fine-tuned with multi-stage approaches. Code is available for reproduction at https://github.com/fpezzuti/multistage-finetuning.
中文: 通过单阶段对比学习或多阶段方法微调的交叉编码器在段落重排序任务中表现出相当的效能,相关代码已开源供复现。
English: Fine-tuning cross-encoders for passage re-ranking achieves comparable effectiveness whether using single-stage contrastive learning or multi-stage approaches, with code available for reproduction.

Authors:Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang
Title: Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Abstract:
Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training. Code is available at https://github.com/Yu-xm/Unicorn.git.
中文: 该研究提出三阶段框架,仅通过文本生成高质量多模态训练数据,无需真实图像即可构建用于预训练的Unicorn-1.2M和指令调优的Unicorn-471K-Instruction数据集,为视觉语言模型训练提供了经济高效的解决方案。
English: The proposed three-stage framework synthesizes high-quality multimodal training data from text alone, generating Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning without real images, offering a cost-effective solution for vision-language model training.

Authors:Zhendi Gong, Susan Francis, Eleanor Cox, Stamatios N. Sotiropoulos, Dorothee P. Auer, Guoping Qiu, Andrew P. French, Xin Chen
Title: MO-CTranS: A unified multi-organ segmentation model learning from multiple heterogeneously labelled datasets
Abstract:
Multi-organ segmentation holds paramount significance in many clinical tasks. In practice, compared to large fully annotated datasets, multiple small datasets are often more accessible and organs are not labelled consistently. Normally, an individual model is trained for each of these datasets, which is not an effective way of using data for model learning. It remains challenging to train a single model that can robustly learn from several partially labelled datasets due to label conflict and data imbalance problems. We propose MO-CTranS: a single model that can overcome such problems. MO-CTranS contains a CNN-based encoder and a Transformer-based decoder, which are connected in a multi-resolution manner. Task-specific tokens are introduced in the decoder to help differentiate label discrepancies. Our method was evaluated and compared to several baseline models and state-of-the-art (SOTA) solutions on abdominal MRI datasets that were acquired in different views (i.e. axial and coronal) and annotated for different organs (i.e. liver, kidney, spleen). Our method achieved better performance (most were statistically significant) than the compared methods. Github link: https://github.com/naisops/MO-CTranS.
中文: 多器官分割在临床任务中至关重要,但利用多个部分标注数据集训练单一模型常面临标签冲突和数据不平衡的挑战,而提出的MO-CTranS模型通过结合CNN与Transformer架构并引入任务特定标记,在腹部MRI数据上相比基线方法和现有最优方案取得了更优性能。
English: Multi-organ segmentation is crucial in clinical practice, yet training a single model on multiple partially labeled datasets faces challenges like label conflicts and data imbalance, which the proposed MO-CTranS model overcomes by integrating CNN and Transformer architectures with task-specific tokens, achieving superior performance on abdominal MRI datasets compared to baseline and SOTA methods.

Authors:Jing Li, Hao Sun
Title: Learnable cut flow for high energy physics
Abstract:
Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but requires extensive manual tuning to identify optimal cut boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this strategy, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF 1. accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, 2. assigns higher importance to discriminative features with minimal overlap, 3. handles redundant or correlated features robustly, and 4. performs effectively in real-world scenarios. In the diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance. Source code and experimental data are available at https://github.com/Star9daisy/learnable-cut-flow.
Chinese: 可学习截断流(LCF)方法将传统的截断选择转化为可微分的数据驱动过程,结合了截断流的可解释性与神经网络的高效性,并通过实际测试提供了对特征重要性的深入理解。
English: The Learnable Cut Flow (LCF) method transforms traditional cut selection into a differentiable, data-driven process, combining the interpretability of cut flows with neural network efficiency and providing insights into feature importance through real-world testing.

Authors:Jing Li, Hao Sun
Title: Learnable cut flow for high energy physics
Abstract:
Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but requires extensive manual tuning to identify optimal cut boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this strategy, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF 1. accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, 2. assigns higher importance to discriminative features with minimal overlap, 3. handles redundant or correlated features robustly, and 4. performs effectively in real-world scenarios. In the diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance. Source code and experimental data are available at https://github.com/Star9daisy/learnable-cut-flow.
Chinese: 可学习截断流(LCF)方法将传统的截断选择转化为可微分的数据驱动过程,结合了截断流的可解释性与神经网络的高效性,并通过实际测试提供了对特征重要性的深入理解。
English: The Learnable Cut Flow (LCF) method transforms traditional cut selection into a differentiable, data-driven process, combining the interpretability of cut flows with neural network efficiency and providing insights into feature importance through real-world testing.

Authors:Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan
Title: SPDNet: Seasonal-Periodic Decomposition Network for Advanced Residential Demand Forecasting
Abstract:
Residential electricity demand forecasting is critical for efficient energy management and grid stability. Accurate predictions enable utility companies to optimize planning and operations. However, real-world residential electricity demand data often exhibit intricate temporal variability, including multiple seasonalities, periodicities, and abrupt fluctuations, which pose significant challenges for forecasting models. Previous models that rely on statistical methods, recurrent, convolutional neural networks, and transformers often struggle to capture these intricate temporal dynamics. To address these challenges, we propose the Seasonal-Periodic Decomposition Network (SPDNet), a novel deep learning framework consisting of two main modules. The first is the Seasonal-Trend Decomposition Module (STDM), which decomposes the input data into trend, seasonal, and residual components. The second is the Periodical Decomposition Module (PDM), which employs the Fast Fourier Transform to identify the dominant periods. For each dominant period, 1D input data is reshaped into a 2D tensor, where rows represent periods and columns correspond to frequencies. The 2D representations are then processed through three submodules: a 1D convolution to capture sharp fluctuations, a transformer-based encoder to model global patterns, and a 2D convolution to capture interactions between periods. Extensive experiments conducted on real-world residential electricity load data demonstrate that SPDNet outperforms traditional and advanced models in both forecasting accuracy and computational efficiency. The code is available in this repository: https://github.com/Tims2D/SPDNet.
中文: 提出的季节性周期分解网络(SPDNet)通过将数据分解为季节性、趋势和周期分量,有效应对住宅电力需求的复杂时序变化,在预测精度和计算效率上均优于现有模型。
English: The proposed Seasonal-Periodic Decomposition Network (SPDNet) effectively addresses complex temporal variations in residential electricity demand by decomposing data into seasonal, trend, and periodic components, outperforming existing models in accuracy and efficiency.

Authors:Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme
Title: STADE: Standard Deviation as a Pruning Metric
Abstract:
Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/
Chinese: 本研究在Wanda剪枝方法的基础上,从理论上解释了其有效性,并提出基于输入标准差的新方法STADE,该方法具有更广泛的适用性,并通过在Llama和OPT等模型上的实验验证了理论发现。
English: This study builds on the Wanda pruning method by providing a theoretical justification for its effectiveness and introduces STADE, a new pruning approach based on input standard deviation that offers broader applicability and is validated through experiments on models like Llama and OPT.

Authors:Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, Wei-Shi Zheng
Title: Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Abstract:
Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.
Chinese: 提出的自适应多正常动作表示(AMNAR)框架通过动态预测所有有效后续动作并与当前动作进行比较,解决了现有错误检测方法的局限性,大量实验证明其实现了最先进的性能。
English: The proposed Adaptive Multiple Normal Action Representation (AMNAR) framework addresses limitations in existing error detection methods by dynamically predicting all valid next actions and comparing them with ongoing actions, achieving state-of-the-art performance through extensive experiments.

Authors:Ada Gorgun, Bernt Schiele, Jonas Fischer
Title: VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
Abstract:
Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information the network uses, complementing mechanistic circuits that identify where it is encoded. Code is available at: https://github.com/adagorgun/VITAL
神经网络在复杂任务中至关重要,但其推理过程难以解释;我们的方法通过利用真实图像统计和网络流来改进特征可视化,生成更清晰、更易理解的图像,性能优于现有技术。
Neural networks are crucial for complex tasks, but their reasoning is hard to interpret; our method improves feature visualization by using real image statistics and network flow to generate clearer, more understandable images that outperform current techniques.

Authors:Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao
Title: Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment
Abstract:
Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation. The code is available at https://github.com/Jiahao-UTS/TUFA.
中文摘要:本文提出了一种任务无关的统一人脸对齐(TUFA)框架,通过将不同数据集的平均人脸形状在可解释平面上对齐来消除知识差异,从而提升模型的泛化能力、小样本学习效果,并具备零样本定位未训练关键点的能力。
English Summary: This paper introduces a Task-agnostic Unified Face Alignment (TUFA) framework that mitigates knowledge discrepancies across datasets by aligning mean face shapes on an interpretable plane, enabling improved generalization, few-shot learning, and zero-shot capability for unseen landmarks.

Authors:Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman
Title: Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.
中文摘要:本文提出了一个包含三项关键创新的综合框架——新型一致性评估指标、专业基准数据集和置信度感知生成方法,可在不牺牲准确性的前提下显著提升大语言模型在多轮对话中的稳定性和可靠性。
English Summary: This paper presents a comprehensive framework with three key innovations—a novel consistency metric, a specialized benchmark dataset, and a confidence-aware generation method—to significantly enhance the stability and reliability of Large Language Models in multi-turn interactions without compromising accuracy.

Authors:Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji
Title: CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
Abstract:
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
Chinese: 本文提出完成剪枝策略优化(CPPO),通过剪枝低优势完成项并动态分配问题,在GSM8K上实现最高8.32倍加速,同时保持甚至提升相较于GRPO的准确率。
English: This paper proposes Completion Pruning Policy Optimization (CPPO), a method that accelerates reasoning model training by pruning low-advantage completions and dynamically allocating questions, achieving up to 8.32× speedup on GSM8K while maintaining accuracy compared to GRPO.

Authors:Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra
Title: A Refined Analysis of Massive Activations in LLMs
Abstract:
Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.
中文: 本研究分析了多种大型语言模型中的大规模激活现象,挑战了先前假设,指出并非所有此类激活都有害且现有缓解策略具有模型特异性,同时提出了结合目标方差重缩放与注意力KV偏置或动态Tanh的混合方法,能在不影响性能的情况下有效管理激活。
English: This study analyzes massive activations across various large language models, challenging previous assumptions by showing that not all are harmful and that existing mitigation strategies are model-specific, while proposing hybrid approaches like combining Target Variance Rescaling with Attention KV bias or Dynamic Tanh to effectively manage activations without compromising performance.

Authors:Yancong Lin, Shiming Wang, Liangliang Nan, Julian Kooij, Holger Caesar
Title: VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
Abstract:
Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at https://github.com/tudelft-iv/VoteFlow.
中文摘要: 本文提出一种轻量级投票模块,通过离散化投票空间和可微分投票机制在场景流估计中强化局部刚性约束,以微小计算代价在自动驾驶数据集上实现性能提升。
English Summary: This paper introduces a lightweight Voting Module that enforces local rigidity in scene flow estimation through a discretized voting space and differentiable voting, improving performance on autonomous driving datasets with minimal computational overhead.

Authors:Ruiguang Pei, Junjie Wu, Dan Peng, Min Fang, Jianan Zhang, Zhihui Fu, Jun Wang
Title: SimDC: A High-Fidelity Device Simulation Platform for Device-Cloud Collaborative Computing
Abstract:
The advent of edge intelligence and escalating concerns for data privacy protection have sparked a surge of interest in device-cloud collaborative computing. Large-scale device deployments to validate prototype solutions are often prohibitively expensive and practically challenging, resulting in a pronounced demand for simulation tools that can emulate realworld scenarios. However, existing simulators predominantly rely solely on high-performance servers to emulate edge computing devices, overlooking (1) the discrepancies between virtual computing units and actual heterogeneous computing devices and (2) the simulation of device behaviors in real-world environments. In this paper, we propose a high-fidelity device simulation platform, called SimDC, which uses a hybrid heterogeneous resource and integrates high-performance servers and physical mobile phones. Utilizing this platform, developers can simulate numerous devices for functional testing cost-effectively and capture precise operational responses from varied real devices. To simulate real behaviors of heterogeneous devices, we offer a configurable device behavior traffic controller that dispatches results on devices to the cloud using a user-defined operation strategy. Comprehensive experiments on the public dataset show the effectiveness of our simulation platform and its great potential for application. The code is available at https://github.com/opas-lab/olearning-sim.
Chinese: 本文提出了SimDC高保真设备仿真平台,通过整合服务器和物理手机来精确模拟异构设备及其真实行为,弥补现有仿真工具的不足,实现经济高效的功能测试。
English: The paper introduces SimDC, a high-fidelity device simulation platform that integrates servers and physical mobile phones to accurately emulate heterogeneous devices and their real-world behaviors, addressing gaps in existing simulators and enabling cost-effective testing.

Authors:Xinghua Liu, Ming Cao
Title: Robust simultaneous UWB-anchor calibration and robot localization for emergency situations
Abstract:
In this work, we propose a factor graph optimization (FGO) framework to simultaneously solve the calibration problem for Ultra-WideBand (UWB) anchors and the robot localization problem. Calibrating UWB anchors manually can be time-consuming and even impossible in emergencies or those situations without special calibration tools. Therefore, automatic estimation of the anchor positions becomes a necessity. The proposed method enables the creation of a soft sensor providing the position information of the anchors in a UWB network. This soft sensor requires only UWB and LiDAR measurements measured from a moving robot. The proposed FGO framework is suitable for the calibration of an extendable large UWB network. Moreover, the anchor calibration problem and robot localization problem can be solved simultaneously, which saves time for UWB network deployment. The proposed framework also helps to avoid artificial errors in the UWB-anchor position estimation and improves the accuracy and robustness of the robot-pose. The experimental results of the robot localization using LiDAR and a UWB network in a 3D environment are discussed, demonstrating the performance of the proposed method. More specifically, the anchor calibration problem with four anchors and the robot localization problem can be solved simultaneously and automatically within 30 seconds by the proposed framework. The supplementary video and codes can be accessed via https://github.com/LiuxhRobotAI/Simultaneous_calibration_localization.
中文: 本研究提出了一种因子图优化框架,仅通过移动机器人的激光雷达和超宽带测量数据,即可在30秒内同步完成四锚点系统的自动校准与机器人定位,实现了高效部署。
English: This study introduces a factor graph optimization framework that simultaneously calibrates Ultra-WideBand anchor positions and performs robot localization using only LiDAR and UWB measurements from a moving robot, achieving full automation within 30 seconds for a four-anchor setup.

Authors:Dongping Liao, Xitong Gao, Yabo Xu, Chengzhong Xu
Title: FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning
Abstract:
The increasing emphasis on privacy and data security has driven the adoption of federated learning, a decentralized approach to train machine learning models without sharing raw data. Prompt learning, which fine-tunes prompt embeddings of pretrained models, offers significant advantages in federated settings by reducing computational costs and communication overheads while leveraging the strong performance and generalization capabilities of vision-language models such as CLIP. This paper addresses the intersection of federated learning and prompt learning, particularly for vision-language models. In this work, we introduce a comprehensive framework, named FLIP, to evaluate federated prompt learning algorithms. FLIP assesses the performance of 8 state-of-the-art federated prompt learning methods across 4 federated learning protocols and 12 open datasets, considering 6 distinct evaluation scenarios. Our findings demonstrate that prompt learning maintains strong generalization performance in both in-distribution and out-of-distribution settings with minimal resource consumption. This work highlights the effectiveness of federated prompt learning in environments characterized by data scarcity, unseen classes, and cross-domain distributional shifts. We open-source the code for all implemented algorithms in FLIP to facilitate further research in this domain.
中文: 联邦提示学习通过FLIP框架展示了其在保护隐私的同时,利用预训练视觉语言模型以最少资源消耗实现高效训练,并在多种数据场景下保持强大的泛化能力。
English: Federated prompt learning, exemplified by the FLIP framework, enables efficient and privacy-preserving model training by leveraging pre-trained vision-language models with minimal resource usage while maintaining robust generalization across diverse data scenarios.

Authors:Jaewoo Jeong, Seohee Lee, Daehee Park, Giwon Lee, Kuk-Jin Yoon
Title: Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
Abstract:
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.
Chinese: 该研究提出了一种多模态知识蒸馏框架,通过将融合轨迹、姿态和文本的教师模型知识提炼到仅使用轨迹或姿态的学生模型中,在多个数据集上实现了行人轨迹预测性能高达13%的提升。
English: The study introduces a multi-modal knowledge distillation framework where a student model, limited to trajectory or human pose data, is trained by a teacher model that integrates trajectory, pose, and text, achieving up to 13% improvement in pedestrian trajectory forecasting across various datasets.

Authors:Chanhyuk Lee, Jiho Choi, Chanryeol Lee, Donggyun Kim, Seunghoon Hong
Title: AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
Abstract:
Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework, significantly enhancing computational efficiency in multi-task learning. Recently, several SVD-based techniques have been introduced to exploit low-rank structures for enhanced merging, but their reliance on such manually designed rank selection often leads to cross-task interference and suboptimal performance. In this paper, we propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. We empirically show that the dominant singular components of task vectors can cause critical interference with other tasks, and that naive truncation across tasks and layers degrades performance. In contrast, AdaRank dynamically prunes the singular components that cause interference and offers an optimal amount of information to each task vector by learning to prune ranks during test-time via entropy minimization. Our analysis demonstrates that such method mitigates detrimental overlaps among tasks, while empirical results show that AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
中文摘要:AdaRank是一种自适应模型融合框架,通过熵最小化动态选择任务向量的最优奇异分量,有效减少任务间干扰,以接近最优性能的表现将性能差距缩小至约1%。
English Summary: AdaRank is an adaptive model merging framework that dynamically selects optimal singular components from task vectors through entropy minimization, effectively reducing cross-task interference and achieving near state-of-the-art performance with minimal performance gap.

Authors:Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han
Title: Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Abstract:
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts-the first visualization tool for users to inspect the reasoning paths of chain-of-thought and its derivatives on any multi-choice dataset. Specifically, we represent the states in a reasoning path as feature vectors that quantify their distances to all answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt our tool to a model that predicts the property they observe. We showcase this advantage by adapting our tool to a lightweight verifier that evaluates the correctness of reasoning paths. Empirically, this verifier boosts the accuracy of reasoning as well as the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
Chinese Summary: 该研究提出了“思维景观”可视化工具,通过将大语言模型的推理路径转化为二维特征向量图,有效区分模型强弱与答案正误,并可通过适配验证器提升推理准确性和测试扩展效果。
English Summary: The study introduces "landscape of thoughts," a visualization tool that enables users to analyze LLM reasoning paths by mapping them as feature vectors in 2D plots, revealing performance patterns and enhancing model evaluation through an adaptable verifier.

Authors:Ximing Wen, Mallika Mainali, Anik Sen
Title: How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark
Abstract:
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks; however, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks. We curated and annotated a benchmark dataset of 30 images and evaluated the performance of four VLMs of varying sizes. Our results show that the GPT-4 model outperformed all the others, with only one smaller model, GPT-4o-mini, achieving comparable performance. We observed that VLMs often struggle to infer intentions in complex scenarios such as bullying or cheating. Our findings reveal that smaller models can sometimes infer correct intentions despite relying on incorrect visual cues. The dataset is available at https://github.com/ximingwen/ToM-AAAI25-Multimodal.
Chinese: 视觉语言模型在视觉问答中表现出色,但在心智理论任务中仍有不足,GPT-4表现最佳,而小型模型有时能通过错误视觉线索推断正确意图。
English: Vision Language Models show promising reasoning in Visual Question Answering but face challenges in Theory of Mind tasks, with GPT-4 leading performance while smaller models sometimes succeed despite incorrect visual cues.

Authors:Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Mingming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, Yunhong Wang
Title: A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Abstract:
The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.
中文:遥感基础模型的快速发展通过融合多模态数据提升了地理空间数据分析能力,但仍面临数据多样性和计算需求等挑战,本文综述了这些模型的现状并指出了未来研究方向。
English: The rapid development of remote sensing foundation models enhances geospatial data analysis by integrating multiple data modalities, yet faces challenges like data diversity and computational demands, which this review addresses while outlining future research directions.

Authors:Ukcheol Shin, Jinsun Park
Title: Deep Depth Estimation from Thermal Image: Dataset, Benchmark, and Challenges
Abstract:
Achieving robust and accurate spatial perception under adverse weather and lighting conditions is crucial for the high-level autonomy of self-driving vehicles and robots. However, existing perception algorithms relying on the visible spectrum are highly affected by weather and lighting conditions. A long-wave infrared camera (i.e., thermal imaging camera) can be a potential solution to achieve high-level robustness. However, the absence of large-scale datasets and standardized benchmarks remains a significant bottleneck to progress in active research for robust visual perception from thermal images. To this end, this manuscript provides a large-scale Multi-Spectral Stereo (MS$^2$) dataset that consists of stereo RGB, stereo NIR, stereo thermal, stereo LiDAR data, and GNSS/IMU information along with semi-dense depth ground truth. MS$^2$ dataset includes 162K synchronized multi-modal data pairs captured across diverse locations (e.g., urban city, residential area, campus, and high-way road) at different times (e.g., morning, daytime, and nighttime) and under various weather conditions (e.g., clear-sky, cloudy, and rainy). Secondly, we conduct a thorough evaluation of monocular and stereo depth estimation networks across RGB, NIR, and thermal modalities to establish standardized benchmark results on MS$^2$ depth test sets (e.g., day, night, and rainy). Lastly, we provide in-depth analyses and discuss the challenges revealed by the benchmark results, such as the performance variability for each modality under adverse conditions, domain shift between different sensor modalities, and potential research direction for thermal perception. Our dataset and source code are publicly available at https://sites.google.com/view/multi-spectral-stereo-dataset and https://github.com/UkcheolShin/SupDepth4Thermal.
中文: 本文介绍了MS²数据集,这是一个大规模多光谱立体数据集,旨在解决热成像视觉感知研究中缺乏标准化基准的问题,并对不同模态在恶劣条件下的深度估计进行了全面评估和分析。
English: This manuscript introduces the MS² dataset, a large-scale multi-spectral stereo collection designed to address the lack of standardized benchmarks for robust visual perception using thermal imaging, and it provides comprehensive evaluations and analyses of depth estimation across various modalities under adverse conditions.

Authors:Chung-En Sun, Ge Yan, Tsui-Wei Weng
Title: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Abstract:
Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce \textbf{\textit{ThinkEdit}}, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model's parameters, \textbf{\textit{ThinkEdit}} effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at: https://github.com/Trustworthy-ML-Lab/ThinkEdit\
中文: 近期研究发现思维链增强的大语言模型中推理过短会降低性能,为此提出的ThinkEdit方法通过选择性编辑少量注意力头权重,有效减少过短推理并显著提升数学任务准确率。
English: Recent research identifies that overly short reasoning in chain-of-thought augmented LLMs impairs performance, leading to the development of ThinkEdit, a weight-editing method that selectively modifies a small subset of attention heads to effectively reduce this issue and improve accuracy on mathematical tasks.

Authors:Chung-En Sun, Ge Yan, Tsui-Wei Weng
Title: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Abstract:
Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model's parameters, ThinkEdit effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at: https://github.com/Trustworthy-ML-Lab/ThinkEdit
中文: 近期研究发现思维链增强的大语言模型中推理过短会降低性能,为此提出的ThinkEdit方法通过选择性编辑少量注意力头权重,有效减少过短推理并显著提升数学任务准确率。
English: Recent research identifies that overly short reasoning in chain-of-thought augmented LLMs impairs performance, leading to the development of ThinkEdit, a weight-editing method that selectively modifies a small subset of attention heads to effectively reduce this issue and improve accuracy on mathematical tasks.

Authors:Heejin Kook, Junyoung Kim, Seongmin Park, Jongwuk Lee
Title: Empowering Retrieval-based Conversational Recommendation with Contrasting User Preferences
Abstract:
Conversational recommender systems (CRSs) are designed to suggest the target item that the user is likely to prefer through multi-turn conversations. Recent studies stress that capturing sentiments in user conversations improves recommendation accuracy. However, they employ a single user representation, which may fail to distinguish between contrasting user intentions, such as likes and dislikes, potentially leading to suboptimal performance. To this end, we propose a novel conversational recommender model, called COntrasting user pReference expAnsion and Learning (CORAL). Firstly, CORAL extracts the user's hidden preferences through contrasting preference expansion using the reasoning capacity of the LLMs. Based on the potential preference, CORAL explicitly differentiates the contrasting preferences and leverages them into the recommendation process via preference-aware learning. Extensive experiments show that CORAL significantly outperforms existing methods in three benchmark datasets, improving up to 99.72% in Recall@10. The code and datasets are available at https://github.com/kookeej/CORAL
Chinese: 提出的CORAL模型通过大型语言模型对比和扩展用户偏好来增强对话推荐系统,显著提高了推荐准确性,并在Recall@10指标上以高达99.72%的优势超越现有方法。
English: The proposed CORAL model enhances conversational recommender systems by using large language models to contrast and expand user preferences, significantly improving recommendation accuracy and outperforming existing methods by up to 99.72% in Recall@10.

Authors:Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng
Title: BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
Abstract:
In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.
中文: 本文提出BOOTPLACE这一新颖的检测式放置范式,通过在用物体剔除背景训练的特化检测变换器中识别合适放置区域,并根据互补特性将目标合成物体与检测区域语义关联,在Cityscapes和OPA等基准测试中实现了优越性能。
English: This paper introduces BOOTPLACE, a novel placement-by-detection paradigm that identifies suitable regions for object placement using a detection transformer trained on object-subtracted backgrounds and associates objects with regions based on complementary characteristics, achieving superior performance on benchmarks like Cityscapes and OPA datasets.

Authors:Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, Chen Change Loy
Title: Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Abstract:
Unifying visual understanding and generation within a single multimodal framework remains a significant challenge, as the two inherently heterogeneous tasks require representations at different levels of granularity. Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. A preliminary study on the MAR encoder's representation reveals exceptional linear probing accuracy and precise feature response to visual concepts, which indicates MAR's potential for visual understanding tasks beyond its original generation role. Based on these insights, we present \emph{Harmon}, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder. Through a three-stage training procedure that progressively optimizes understanding and generation capabilities, Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks while matching the performance of methods with dedicated semantic encoders (e.g., Janus) on image understanding benchmarks. Our code and models will be available at https://github.com/wusize/Harmon.
Chinese: 本研究提出了Harmon框架,通过共享掩码自回归编码器统一视觉理解与生成任务,在GenEval、MJHQ30K和WISE基准测试中实现图像生成的顶尖性能,同时在理解任务上达到专用语义编码器的同等水平。
English: This work introduces Harmon, a unified autoregressive framework that integrates visual understanding and generation using a shared masked autoregressive encoder, achieving state-of-the-art results in image generation on benchmarks like GenEval, MJHQ30K, and WISE while matching the performance of dedicated semantic encoders in understanding tasks.

Authors:Johannes Seiffarth, Katharina Nöh
Title: PyUAT: Open-source Python framework for efficient and scalable cell tracking
Abstract:
Tracking individual cells in live-cell imaging provides fundamental insights, inevitable for studying causes and consequences of phenotypic heterogeneity, responses to changing environmental conditions or stressors. Microbial cell tracking, characterized by stochastic cell movements and frequent cell divisions, remains a challenging task when imaging frame rates must be limited to avoid counterfactual results. A promising way to overcome this limitation is uncertainty-aware tracking (UAT), which uses statistical models, calibrated to empirically observed cell behavior, to predict likely cell associations. We present PyUAT, an efficient and modular Python implementation of UAT for tracking microbial cells in time-lapse imaging. We demonstrate its performance on a large 2D+t data set and investigate the influence of modular biological models and imaging intervals on the tracking performance. The open-source PyUAT software is available at https://github.com/JuBiotech/PyUAT, including example notebooks for immediate use in Google Colab.
Chinese: PyUAT是一种高效、模块化的Python工具,通过基于观测行为校准的统计模型预测细胞关联,实现了不确定性感知追踪,以解决微生物细胞追踪中的挑战。
English: PyUAT is an efficient, modular Python tool that implements uncertainty-aware tracking to overcome challenges in microbial cell tracking by predicting cell associations using statistical models calibrated to observed behavior.

Authors:Taufiq Ahmed, Abhishek Kumar, Constantino Álvarez Casado, Anlan Zhang, Tuomo Hänninen, Lauri Loven, Miguel Bordallo López, Sasu Tarkoma
Title: Exponentially Weighted Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection Model Training in Unmanned Aerial Vehicles Surveillance Scenarios
Abstract:
Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22\% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring. The code is available at: https://github.com/futurians/E-IRFS.
中文: 本文提出的E-IRFS方法通过指数加权采样策略,在线性方法RFS和IRFS基础上显著提升长尾分布中稀有类别的检测性能,在无人机应急监测任务中实现22%的性能提升。
English: This paper introduces E-IRFS, an exponentially weighted sampling method that enhances rare class detection in imbalanced datasets by outperforming linear approaches like RFS and IRFS, achieving a 22% improvement in emergency monitoring scenarios.

Authors:Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang
Title: Foveated Instance Segmentation
Abstract:
Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze-specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real-time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI-
中文:FovealSeg框架利用实时视线数据,仅对用户关注的区域进行实例分割,在显著降低计算负担的同时,在基准测试中保持了高精度表现。
English: FovealSeg is a framework that uses real-time gaze data to perform instance segmentation only on areas of interest, significantly reducing computational load while maintaining high accuracy on benchmarks.

Authors:Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci
Title: On Large Multimodal Models as Open-World Image Classifiers
Abstract:
Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
中文: 本研究在开放世界环境下评估大型多模态模型的图像分类能力,揭示了其在细粒度识别方面的挑战,并表明优化提示策略可有效提升分类效果。
English: This study evaluates Large Multimodal Models in an open-world image classification setting, revealing their challenges with granularity and fine-grained recognition while demonstrating how enhanced prompting can improve performance.

Authors:Yesmine Abdennadher, Giovanni Perin, Riccardo Mazzieri, Jacopo Pegoraro, Michele Rossi
Title: LightSNN: Lightweight Architecture Search for Sparse and Accurate Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs) are highly regarded for their energy efficiency, inherent activation sparsity, and suitability for real-time processing in edge devices. However, most current SNN methods adopt architectures resembling traditional artificial neural networks (ANNs), leading to suboptimal performance when applied to SNNs. While SNNs excel in energy efficiency, they have been associated with lower accuracy levels than traditional ANNs when utilizing conventional architectures. In response, in this work we present LightSNN, a rapid and efficient Neural Network Architecture Search (NAS) technique specifically tailored for SNNs that autonomously leverages the most suitable architecture, striking a good balance between accuracy and efficiency by enforcing sparsity. Based on the spiking NAS network (SNASNet) framework, a cell-based search space including backward connections is utilized to build our training-free pruning-based NAS mechanism. Our technique assesses diverse spike activation patterns across different data samples using a sparsity-aware Hamming distance fitness evaluation. Thorough experiments are conducted on both static (CIFAR10 and CIFAR100) and neuromorphic datasets (DVS128-Gesture). Our LightSNN model achieves state-of-the-art results on CIFAR10 and CIFAR100, improves performance on DVS128Gesture by 4.49\%, and significantly reduces search time most notably offering a $98\times$ speedup over SNASNet and running 30\% faster than the best existing method on DVS128Gesture. Code is available on Github at: https://github.com/YesmineAbdennadher/LightSNN.
中文:LightSNN提出了一种快速高效的脉冲神经网络架构搜索方法,通过基于稀疏性的适应度评估优化架构,在静态和神经形态数据集上实现顶尖精度,同时大幅缩短搜索时间。
English: LightSNN introduces a fast and efficient Neural Architecture Search method for Spiking Neural Networks, optimizing architecture with a sparsity-aware fitness evaluation to achieve top accuracy on static and neuromorphic datasets while drastically reducing search time.

Authors:Mohammad Amin Khalafi, Seyed Amir Ahmad Safavi-Naini, Ameneh Salehi, Nariman Naderi, Dorsa Alijanzadeh, Pardis Ketabi Moghadam, Kaveh Kavosi, Negar Golestani, Shabnam Shahrokh, Soltanali Fallah, Jamil S Samaan, Nicholas P. Tatonetti, Nicholas Hoerter, Girish Nadkarni, Hamid Asadzadeh Aghdaei, Ali Soroush
Title: Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
Abstract:
Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.
中文: 本研究评估了视觉语言模型在结肠镜息肉检测与分类中相对于卷积神经网络和传统机器学习的表现,发现ResNet50等CNN模型仍最优,但BiomedCLIP和GPT-4等视觉语言模型在无法训练CNN时具有检测潜力。
English: This study evaluates vision-language models against CNNs and traditional machine learning for colonoscopy polyp detection and classification, finding CNNs like ResNet50 superior but noting VLMs such as BiomedCLIP and GPT-4 show promise for detection when CNN training is impractical.

Authors:Haolong Yan, Kaijun Tan, Yeqing Shen, Xin Huang, Zheng Ge, Xiangyu Zhang, Si Li, Daxin Jiang
Title: M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
Abstract:
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.
Chinese Summary: 本研究提出了新型多模态文档摘要基准(M-DocSum-Bench),用于评估大视觉语言模型对图文交织文档的理解能力,发现现有模型在复杂多模态场景中难以保持连贯性和准确整合信息。
English Summary: This study introduces a novel Multimodal Document Summarization Benchmark (M-DocSum-Bench) to evaluate Large Vision-Language Models' ability to understand interleaved image-text documents, revealing that current models struggle with coherence and information integration in complex multimodal contexts.

Authors:Jiancheng Zhao, Xingda Yu, Zhen Yang
Title: MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) has become an essential approach for adapting large-scale pre-trained models while reducing computational costs. Among PEFT methods, LoRA significantly reduces trainable parameters by decomposing weight updates into low-rank matrices. However, traditional LoRA applies a fixed rank across all layers, failing to account for the varying complexity of hierarchical information, which leads to inefficient adaptation and redundancy. To address this, we propose MSPLoRA (Multi-Scale Pyramid LoRA), which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information, respectively. This hierarchical structure reduces inter-layer redundancy while maintaining strong adaptation capability. Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters. Furthermore, additional analyses based on Singular Value Decomposition validate its information decoupling ability, highlighting MSPLoRA as a scalable and effective optimization strategy for parameter-efficient fine-tuning in large language models. Our code is available at https://github.com/Oblivioniss/MSPLoRA.
中文:MSPLoRA提出了一种分层参数高效微调方法,通过采用多尺度LoRA模块分别处理全局、中层级和层特定特征,在减少冗余的同时提升适应能力,在多种自然语言处理任务中以更少参数量实现了更优性能。
English: MSPLoRA introduces a hierarchical parameter-efficient fine-tuning method that reduces redundancy and improves adaptation by employing multi-scale LoRA modules for global, mid-level, and layer-specific features, achieving superior performance with fewer parameters across NLP tasks.

Authors:Ivan Diaz, Florin Scherer, Yanik Berli, Roland Wiest, Helly Hammer, Robert Hoepner, Alejandro Leon Betancourt, Piotr Radojewski, Richard McKinley
Title: Learning from spatially inhomogenous data: resolution-adaptive convolutions for multiple sclerosis lesion segmentation
Abstract:
In the setting of clinical imaging, differences in between vendors, hospitals and sequences can yield highly inhomogeneous imaging data. In MRI in particular, voxel dimension, slice spacing and acquisition plane can vary substantially. For clinical applications, therefore, algorithms must be trained to handle data with various voxel resolutions. The usual strategy to deal with heterogeneity of resolution is harmonization: resampling imaging data to a common (usually isovoxel) resolution. This can lead to loss of fidelity arising from interpolation artifacts out-of-plane and downsampling in-plane. We present in this paper a network architecture designed to be able to learn directly from spatially heterogeneous data, without resampling: a segmentation network based on the e3nn framework that leverages a spherical harmonic, rather than voxel-grid, parameterization of convolutional kernels, with a fixed physical radius. Networks based on these kernels can be resampled to their input voxel dimensions. We trained and tested our network on a publicly available dataset assembled from three centres, and on an in-house dataset of Multiple Sclerosis cases with a high degree of spatial inhomogeneity. We compared our approach to a standard U-Net with two strategies for handling inhomogeneous data: training directly on the data without resampling, and resampling to a common resolution of 1mm isovoxels. We show that our network is able to learn from various combinations of voxel sizes and outperforms classical U-Nets on 2D testing cases and most 3D testing cases. This shows an ability to generalize well when tested on image resolutions not seen during training. Our code can be found at: http://github.com/SCAN-NRAD/e3nn\_U-Net.
中文摘要:本文提出了一种新型神经网络架构,可直接处理空间异构的MRI数据而无需重采样,利用球谐卷积核在不同体素分辨率下表现优于传统方法。
English Summary: This paper introduces a novel neural network architecture that directly learns from spatially heterogeneous MRI data without resampling, using spherical harmonic convolutional kernels to outperform traditional methods in handling diverse voxel resolutions.

Authors:Haitong Liu, Kuofeng Gao, Yang Bai, Jinmin Li, Jinxiao Shan, Tao Dai, Shu-Tao Xia
Title: Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Abstract:
Recently, video-based large language models (video-based LLMs) have achieved impressive performance across various video comprehension tasks. However, this rapid advancement raises significant privacy and security concerns, particularly regarding the unauthorized use of personal video data in automated annotation by video-based LLMs. These unauthorized annotated video-text pairs can then be used to improve the performance of downstream tasks, such as text-to-video generation. To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim to mislead video-based LLMs into generating inaccurate captions for the videos, thereby degrading the quality of video annotations through inconsistencies between video content and captions. Mutes, on the other hand, are designed to prompt video-based LLMs to produce exceptionally brief captions, lacking descriptive detail. Extensive experiments demonstrate that our video watermarking methods effectively protect video data by significantly reducing video annotation performance across various video-based LLMs, showcasing both stealthiness and robustness in protecting personal video content. Our code is available at https://github.com/ttthhl/Protecting_Your_Video_Content.
Chinese: 研究人员开发了两种名为Ramblings和Mutes的隐形视频水印,通过误导性描述或过度简化的字幕来降低视频大语言模型的标注质量,从而有效保护个人视频内容。
English: Researchers have developed two imperceptible video watermarks, Ramblings and Mutes, that effectively protect personal videos by degrading the annotation quality of video-based large language models through misleading captions or excessive brevity.

Authors:Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu
Title: UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants
Abstract:
Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.
Chinese: 本文提出了一种统一特征匹配预训练模型(UFM)及多模态图像助手变换器,通过数据增强和分阶段预训练策略,有效解决了跨模态图像特征匹配问题,展现出优异的泛化能力和性能表现。
English: This paper introduces a Unified Feature Matching pre-trained model (UFM) with Multimodal Image Assistant transformers, which effectively handles feature matching across various image modalities through data augmentation and staged pre-training, demonstrating strong generalization and performance.

Authors:Wenjie Qiu, Hongshu Guo, Zeyuan Ma, Yue-Jiao Gong
Title: A Novel Two-Phase Cooperative Co-evolution Framework for Large-Scale Global Optimization with Complex Overlapping
Abstract:
Cooperative Co-evolution, through the decomposition of the problem space, is a primary approach for solving large-scale global optimization problems. Typically, when the subspaces are disjoint, the algorithms demonstrate significantly both effectiveness and efficiency compared to non-decomposition algorithms. However, the presence of overlapping variables complicates the decomposition process and adversely affects the performance of cooperative co-evolution. In this study, we propose a novel two-phase cooperative co-evolution framework to address large-scale global optimization problems with complex overlapping. An effective method for decomposing overlapping problems, grounded in their mathematical properties, is embedded within the framework. Additionally, a customizable benchmark for overlapping problems is introduced to extend existing benchmarks and facilitate experimentation. Extensive experiments demonstrate that the algorithm instantiated within our framework significantly outperforms existing algorithms. The results reveal the characteristics of overlapping problems and highlight the differing strengths of cooperative co-evolution and non-decomposition algorithms. Our work is open-source and accessible at: https://github.com/GMC-DRL/HCC.
中文: 本研究提出了一种新颖的两阶段协同进化框架,通过嵌入基于数学特性的重叠问题分解方法,有效解决了具有复杂重叠变量的大规模全局优化问题,并在可定制基准测试中显著优于现有算法。
English: This study introduces a novel two-phase cooperative co-evolution framework that effectively addresses large-scale global optimization with overlapping variables by embedding a mathematically grounded decomposition method and outperforms existing algorithms, as demonstrated through extensive experiments on a customizable benchmark.

Authors:Yizhen Luo, Jiashuo Wang, Siqi Fan, Zaiqing Nie
Title: PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation
Abstract:
Structural biology relies on accurate three-dimensional biomolecular structures to advance our understanding of biological functions, disease mechanisms, and therapeutics. While recent advances in deep learning have enabled the development of all-atom foundation models for molecular modeling and generation, existing approaches face challenges in generalization due to the multi-modal nature of atomic data and the lack of comprehensive analysis of training and sampling strategies. To address these limitations, we propose PharMolixFM, a unified framework for constructing all-atom foundation models based on multi-modal generative techniques. Our framework includes three variants using state-of-the-art multi-modal generative models. By formulating molecular tasks as a generalized denoising process with task-specific priors, PharMolixFM achieves robust performance across various structural biology applications. Experimental results demonstrate that PharMolixFM-Diff achieves competitive prediction accuracy in protein-small-molecule docking (83.9% vs. 90.2% RMSD < 2Å, given pocket) with significantly improved inference speed. Moreover, we explore the empirical inference scaling law by introducing more sampling repeats or steps. Our code and model are available at https://github.com/PharMolix/OpenBioMed.
中文:PharMolixFM提出了一种统一的多模态生成框架,解决了全原子分子建模中的泛化难题,在蛋白质-小分子对接中实现了具有竞争力的精度和更快的推理速度。
English: PharMolixFM introduces a unified multi-modal generative framework that overcomes generalization challenges in all-atom molecular modeling, achieving competitive accuracy in protein-small-molecule docking with enhanced inference speed.

Authors:Guanjie Huang, Danny Hin Kwok Tsang, Li Liu
Title: Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication
Abstract:
Cued Speech (CS) is an innovative visual communication system that integrates lip-reading with hand coding, designed to enhance effective communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) refers to the AI-driven process of automatically recognizing hand gestures and lip movements in CS, converting them into text. However, previous work often relies on complex fusion modules and training techniques. Additionally, due to the limited amount of data in CS, the extraction of hand features, as well as recognition modeling, has consistently been subpar, significantly limiting the effectiveness of ACSR. To address this issue, we have innovatively explored the capabilities of Multimodal large language models (MLLMs) in recognizing hand shapes and positions in CS. More precisely, we propose a new Semi Training-Free paradigm for ACSR, named STF-ACSR. This approach leverages zero-shot recognition of hand movements through the Chinese CS Prompt Module (CCSPM), which equipped a training-free keyframe filtering and customized prompt engineering based on MLLM. It then integrates the recognition results into the lip-reading model using a Minimalist Fusion Module (MFM), effectively achieving superior recognition results. Furthermore, specifically for this study, we have supplemented the existing dataset of 6 normal hearing CS cuers by recording additional data from 8 cuers with hearing impairments, resulting in a new mixed dataset. Extensive experiments have demonstrated that STF-ACSR significantly outperforms previous methods on both normal and hearing-impaired data. Implementation and checkpoints are available at https://github.com/DennisHgj/STF_ACSR.
中文: 本研究提出STF-ACSR半免训练范式,利用多模态大语言模型通过中文提示模块实现零样本手势识别,并采用极简融合模块与唇读模型结合,在扩充的混合数据集上显著提升了自动暗示语音识别性能。
English: The study introduces STF-ACSR, a semi training-free paradigm using multimodal large language models to enhance automatic Cued Speech recognition by integrating zero-shot hand movement analysis with lip-reading through minimalist fusion, validated on an expanded mixed dataset showing superior performance.

Authors:Abdelrahman Shaker, Muhammad Maaz, Chenhui Gou, Hamid Rezatofighi, Salman Khan, Fahad Shahbaz Khan
Title: Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abstract:
Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.
Chinese: Mobile-VideoGPT是一种高效的多模态框架,通过轻量级双视觉编码器、注意力帧评分机制和令牌剪枝技术,在减少40%参数的同时实现实时处理,并在六大视频理解基准测试中平均超越现有0.5B参数模型6个点。
English: Mobile-VideoGPT is an efficient multimodal framework that reduces computational demands and parameters while achieving higher throughput and outperforming existing 0.5B-parameter models by an average of 6 points across video understanding benchmarks.

Authors:Reza Qorbani, Gianluca Villani, Theodoros Panagiotakopoulos, Marc Botet Colomer, Linus Härenstam-Nielsen, Mattia Segu, Pier Luigi Dovesi, Jussi Karlgren, Daniel Cremers, Federico Tombari, Matteo Poggi
Title: Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Abstract:
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
中文: 提出的语义库适应(SemLA)框架通过基于CLIP嵌入相似性动态融合相关LoRA适配器,实现了开放词汇语义分割的无训练域适应,在跨域性能、可扩展性、可解释性和数据隐私保护方面均表现出色。
English: The proposed Semantic Library Adaptation (SemLA) framework enables training-free domain adaptation for open-vocabulary semantic segmentation by dynamically merging relevant LoRA adapters based on CLIP embedding proximity, achieving superior performance across diverse domains while ensuring scalability, explainability, and data privacy.

Authors:Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
Title: Test-Time Visual In-Context Tuning
Abstract:
Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.
中文: 本文提出测试时视觉上下文调优(VICT)方法,通过单一样本自适应和循环一致性损失重构原始任务提示,有效提升了视觉上下文学习模型在未知领域的泛化能力。
English: This paper introduces test-time Visual In-Context Tuning (VICT), a method that enhances the generalizability of visual in-context learning models to unseen domains by adapting them with a single test sample and using cycle consistency loss to recover original task prompts.

Authors:Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue
Title: Video-R1: Reinforcing Video Reasoning in MLLMs
Abstract:
Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.
中文:基于DeepSeek-R1的成功,Video-R1率先将R1范式引入多模态大模型的视频推理领域,通过提出T-GRPO算法加强时序建模并融合图像推理数据,在多个基准测试中实现突破性进展,尤其在VSI-Bench视频空间推理基准上以37.1%的准确率超越商用模型GPT-4o。
English: Building on DeepSeek-R1's success, Video-R1 pioneers the R1 paradigm for video reasoning in MLLMs by introducing the T-GRPO algorithm to enhance temporal modeling and incorporating image-reasoning data, achieving state-of-the-art performance on multiple benchmarks including surpassing GPT-4o on VSI-Bench with 37.1% accuracy.

Authors:Jianning Pei, Han Hu, Shuyang Gu
Title: Optimal Stepsize for Diffusion Sampling
Abstract:
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at https://github.com/bebebe666/OptimalSteps.
Chinese: 本文提出最优步长蒸馏法,通过动态编程框架优化扩散模型的步长调度,在保持99.4%性能的同时实现文本到图像生成速度提升10倍。
English: This paper introduces Optimal Stepsize Distillation, a dynamic programming framework that accelerates diffusion model sampling by optimizing stepsize schedules, achieving 10x faster text-to-image generation with minimal performance loss.

Authors:Hongkai Lin, Dingkang Liang, Zhenghao Qi, Xiang Bai
Title: A Unified Image-Dense Annotation Generation Model for Underwater Scenes
Abstract:
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https://github.com/HongkLin/TIDE
中文: 本文提出TIDE方法,通过文本输入统一生成水下图像与密集标注,有效缓解水下密集预测任务中的数据匮乏问题,显著提升了现有模型的性能表现。
English: This paper introduces TIDE, a unified text-to-image and dense annotation generation method that creates realistic underwater images and consistent annotations to address data scarcity in underwater dense prediction tasks, significantly improving model performance.

Authors:Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin Wang
Title: Exploring the Evolution of Physics Cognition in Video Generation: A Survey
Abstract:
Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.
中文摘要:近期视频生成技术虽在扩散模型推动下取得进展,但存在物理认知缺陷,导致生成内容违反物理规律;本综述通过建立三层认知分类体系,系统梳理了物理认知融合方法,旨在推动视频生成从“视觉模仿”迈向“类人物理理解”的新阶段。
English Summary: Recent video generation advancements using diffusion models face challenges in physical realism, prompting research into integrating physical cognition for more accurate simulations, with this survey providing a systematic taxonomy and future directions to bridge the gap between visual and physical fidelity.

Authors:Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao
Title: Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Abstract:
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.
Lumina-Image 2.0 采用统一架构和高效训练策略,通过跨模态交互优化实现了卓越的文本到图像生成性能,仅用少量参数即可展现强大效果。
Lumina-Image 2.0 introduces a unified text-to-image generation framework with enhanced cross-modal interaction and efficiency, achieving superior performance with minimal parameters.

Authors:Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei Liu
Title: VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Abstract:
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.
中文: 视频生成已从生成不真实内容发展到视觉逼真,但现有基准如VBench仅关注表面真实性,因此推出VBench-2.0评估物理规律和常识等内在真实性,以实现真正的现实模拟。
English: Video generation has progressed from unrealistic outputs to visually convincing videos, yet current benchmarks like VBench focus on superficial aspects, prompting the introduction of VBench-2.0 to evaluate intrinsic faithfulness in areas like physics and commonsense for true realism.

Authors:Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, Ziwei Liu
Title: 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models
Abstract:
3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications. Project page is available at https://zyh482.github.io/3DGen-Bench/.
中文: 为解决三维生成领域缺乏人类偏好数据的问题,研究者开发了3DGen-Arena平台和3DGen-Bench数据集,并基于此构建了3DGen-Score与3DGen-Eval自动评估系统,实验证明其评估结果比现有指标更符合人类判断标准。
English: The authors introduce 3DGen-Arena and 3DGen-Bench to address the lack of human preference data in 3D generation, developing automated evaluators 3DGen-Score and 3DGen-Eval that demonstrate superior alignment with human judgment compared to existing metrics.

Authors:Yongce Li, Chung-En Sun, Tsui-Wei Weng
Title: Effective Skill Unlearning through Intervention and Abstention
Abstract:
Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning
中文: 本文提出了两种轻量级、无需训练的大语言模型技能遗忘方法,能有效消除特定技能,同时保持模型的整体能力和通用知识。
English: This paper introduces two lightweight, training-free methods for skill unlearning in large language models, which effectively eliminate specific skills while preserving overall capabilities and general knowledge.

Authors:Pietro Tropeano, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
Title: As easy as PIE: understanding when pruning causes language models to disagree
Abstract:
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE
中文: 语言模型剪枝常忽视对特定数据点(称为PIEs)造成的显著准确性下降,这些数据点对模型泛化至关重要,且在BERT等模型中因文本更长、语义更复杂而更易受影响。
English: Language model pruning often overlooks the disproportionate accuracy loss on specific data points called PIEs, which are crucial for generalization and more affected in models like BERT due to their longer and semantically complex texts.

Authors:Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang
Title: Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Abstract:
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
中文摘要:Embodied Reasoner模型通过合成数千条观察-思考-行动轨迹并采用三阶段训练流程,将深度推理能力扩展到交互式具身任务中,在评估中显著超越了先进的视觉推理模型。
English Summary: The Embodied Reasoner model extends deep reasoning to interactive embodied tasks by synthesizing thousands of Observation-Thought-Action trajectories and employing a three-stage training pipeline, significantly outperforming advanced visual reasoning models in evaluations.

Authors:Zhiyuan Ma, Xinyue Liang, Rongyuan Wu, Xiangyu Zhu, Zhen Lei, Lei Zhang
Title: Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data
Abstract:
It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.
Chinese: 提出的渐进式渲染蒸馏(PRD)方法通过融合多视角扩散模型实现无需3D真值数据的文本到3D生成,其训练的TriplaneTurbo模型仅需1.2秒即可生成高质量3D网格且泛化能力强。
English: The proposed Progressive Rendering Distillation (PRD) method enables efficient text-to-3D generation by distilling multi-view diffusion models without requiring 3D ground-truth data, resulting in TriplaneTurbo that produces high-quality 3D meshes in just 1.2 seconds.

Authors:Yassir Lairgi
Title: When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in Morocco
Abstract:
The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at https://github.com/lairgiyassir/manazel) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.
Chinese: Manazel项目通过机器学习分析13年新月可见性数据,优化了ODEH准则,以98.83%的预测准确率提升了摩洛哥伊斯兰历月初判定的精确度。
English: The Manazel project enhances the accuracy of determining Hijri months in Morocco by applying machine learning to 13 years of lunar data, achieving 98.83% predictive accuracy through refined visibility criteria.

Authors:Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li
Title: UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Abstract:
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.
中文:UI-R1框架首次采用基于规则的强化学习增强多模态大语言模型在图形界面操作预测中的推理能力,仅用少量高质量训练数据即在多项基准测试中实现显著精度提升。
English: The UI-R1 framework employs rule-based reinforcement learning to significantly enhance multimodal large language models' reasoning for GUI action prediction, achieving notable accuracy improvements across various benchmarks with a compact training dataset.

Authors:Alimjan Mattursun, Liejun Wang, Yinfeng Yu, Chunyang Ma
Title: Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting
Abstract:
Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{https://github.com/AlimMat/BSP-MPNet. \label{s1}}
中文: 本文提出BSP-MPNet双路径框架,通过结合自监督特征与幅度-相位信息进行语音增强,在多种噪声环境的基准测试中展现出优于现有方法的性能。
English: This paper introduces BSP-MPNet, a dual-path framework that integrates self-supervised learning with magnitude-phase processing to advance speech enhancement, demonstrating superior performance across multiple noise conditions on benchmark datasets.

Authors:Jonathan Lee, Bolivar Solarte, Chin-Hsuan Wu, Jin-Cheng Jhang, Fu-En Wang, Yi-Hsuan Tsai, Min Sun
Title: uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images
Abstract:
We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.
中文总结:uLayout是一种统一模型,通过将透视和全景图像转换为等距柱状投影,并采用共享特征提取器与领域特定调节,实现了对两种图像类型的房间布局几何的精确估计。
English Summary: uLayout is a unified model that accurately estimates room layouts from both perspective and panoramic images by converting them into equirectangular projections and using a shared feature extractor with domain-specific conditioning.

Authors:Juliana Costa-Silva, David Menotti, Fabricio M. Lopes
Title: consexpressionR: an R package for consensus differential gene expression analysis
Abstract:
Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Computational methods have been developed for each stage of the identification of differentially expressed genes. Nevertheless, there are few studies exploring the association between different types of methods. In this study, we evaluated the impact of the association of methodologies in the results of differential expression analysis. By adopting two data sets with qPCR data (to gold-standard reference), seven methods were implemented and assessed in R packages (EBSeq, edgeR, DESeq2, limma, SAMseq, NOISeq, and Knowseq), which was performed and assessed separately and in association. The results were evaluated considering the adopted qPCR data. Results: Here, we introduce consexpressionR, an R package that automates differential expression analysis using consensus of at least seven methodologies, producing more assertive results with a significant reduction in false positives. Availability: consexpressionR is an R package available via source code and support are available at GitHub (https://github.com/costasilvati/consexpressionR).
中文: 本研究推出了consexpressionR这一R软件包,它通过整合七种方法自动化进行差异表达分析,以提高准确性并减少假阳性结果。
English: The study introduces consexpressionR, an R package that automates differential expression analysis by combining seven methods to improve accuracy and reduce false positives.

Authors:Yuwei Yin, EunJeong Hwang, Giuseppe Carenini
Title: SWI: Speaking with Intent in Large Language Models
Abstract:
Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs' generation and reasoning abilities with cognitive notions.
中文: 本文提出“有意图对话”方法,通过显式生成意图来增强大语言模型的推理能力和生成质量,实验和人工评估均证实其有效性。
English: This paper proposes Speaking with Intent (SWI) for large language models, where explicit intent generation enhances reasoning and output quality, as validated by experiments and human evaluation.

Authors:Achint Soni, Meet Soni, Sirisha Rambhatla
Title: LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
Abstract:
Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on https://github.com/LOCATEdit/LOCATEdit/
中文: 本文提出LOCATEdit方法,通过基于图的自注意力关系增强交叉注意力图,在文本引导图像编辑中保持空间一致性并减少伪影,在PIE-Bench基准测试中实现了最先进的性能。
English: This paper introduces LOCATEdit, a novel text-guided image editing method that improves cross-attention maps using graph-based self-attention relationships to maintain spatial consistency and minimize artifacts, achieving state-of-the-art performance on PIE-Bench.

Authors:Haote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun, Yinfan Wang, Zijian Győző Yang, Junyuan Gao, Jingchao Wang, Bowen Jiang, Shasha Wang, Nanjun Yu, Zihao Zhang, Shixin Hong, Hongwei Liu, Wei Li, Songyang Zhang, Dahua Lin, Lijun Wu, Gábor Prószéky, Conghui He
Title: OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Abstract:
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
中文: OpenHuEval是首个针对匈牙利语的大语言模型评测基准,通过真实用户查询和多维度评估方法,揭示了针对匈牙利语特性进行模型优化的必要性。
English: OpenHuEval is the first comprehensive benchmark designed to evaluate large language models on Hungarian language proficiency, incorporating real user queries and multidimensional assessments to highlight the need for language-specific model optimization.

Authors:Huacheng Li, Jingyong Su, Kai Wang
Title: Advancing CAN Network Security through RBM-Based Synthetic Attack Data Generation for Intrusion Detection Systems
Abstract:
The rapid development of network technologies and industrial intelligence has augmented the connectivity and intelligence within the automotive industry. Notably, in the Internet of Vehicles (IoV), the Controller Area Network (CAN), which is crucial for the communication of electronic control units but lacks inbuilt security measures, has become extremely vulnerable to severe cybersecurity threats. Meanwhile, the efficacy of Intrusion Detection Systems (IDS) is hampered by the scarcity of sufficient attack data for robust model training. To overcome this limitation, we introduce a novel methodology leveraging the Restricted Boltzmann Machine (RBM) to generate synthetic CAN attack data, thereby producing training datasets with a more balanced sample distribution. Specifically, we design a CAN Data Processing Module for transforming raw CAN data into an RBM-trainable format, and a Negative Sample Generation Module to generate data reflecting the distribution of CAN data frames denoting network intrusions. Experimental results show the generated data significantly improves IDS performance, with CANet accuracy rising from 0.6477 to 0.9725 and EfficientNet from 0.1067 to 0.1555. Code is available at https://github.com/wangkai-tech23/CANDataSynthetic.
中文摘要:本研究提出了一种利用受限玻尔兹曼机生成合成CAN攻击数据的新方法,有效平衡了训练数据集,显著提升了车联网中入侵检测系统的性能。
English Summary: This study introduces a novel method using Restricted Boltzmann Machine to generate synthetic CAN attack data, effectively balancing training datasets and significantly enhancing Intrusion Detection System performance in the Internet of Vehicles.

Authors:Shuming Liu, Chen Zhao, Tianqi Xu, Bernard Ghanem
Title: BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Abstract:
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at https://github.com/sming256/BOLT.
中文: BOLT方法通过逆向变换采样策略优化视频语言模型的长视频分析能力,无需额外训练即可显著提升多个基准测试的准确率。
English: BOLT enhances large video-language models for long-form video analysis by employing inverse transform sampling for frame selection, boosting accuracy on benchmarks without requiring additional training.

Authors:Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Weisi Lin
Title: Embedding Compression Distortion in Video Coding for Machines
Abstract:
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.
中文: 提出的CDRE框架通过将压缩失真表征嵌入下游模型来增强机器视觉任务性能,有效弥补压缩过程中的信息损失,并以极低开销实现性能提升。
English: The proposed CDRE framework enhances machine vision tasks by embedding compression distortion representations into downstream models, effectively compensating for information loss and improving performance with minimal overhead.

Authors:Tin T. Tran, V. Snasel
Title: Improvement Graph Convolution Collaborative Filtering with Weighted addition input
Abstract:
Graph Neural Networks have been extensively applied in the field of machine learning to find features of graphs, and recommendation systems are no exception. The ratings of users on considered items can be represented by graphs which are input for many efficient models to find out the characteristics of the users and the items. From these insights, relevant items are recommended to users. However, user's decisions on the items have varying degrees of effects on different users, and this information should be learned so as not to be lost in the process of information mining. In this publication, we propose to build an additional graph showing the recommended weight of an item to a target user to improve the accuracy of GNN models. Although the users' friendships were not recorded, their correlation was still evident through the commonalities in consumption behavior. We build a model WiGCN (Weighted input GCN) to describe and experiment on well-known datasets. Conclusions will be stated after comparing our results with state-of-the-art such as GCMC, NGCF and LightGCN. The source code is also included at https://github.com/trantin84/WiGCN.
中文: 本文提出WiGCN加权图神经网络模型,通过行为关联捕捉用户决策的不同影响来提升推荐精度,并通过对标前沿方法的实验验证了其有效性。
English: This paper introduces WiGCN, a weighted graph neural network model that enhances recommendation accuracy by capturing varying user decision influences through behavioral correlations, with experimental validation against leading methods.

Authors:Ryan Marinelli, Josef Pichlmeier, Tamas Bisztray
Title: Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection
Abstract:
In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: https://github.com/rymarinelli/Number_Of_Thoughts/tree/main.
中文: 本文提出“思维数量”指标来评估任务难度并优化大语言模型的提示路由,实现了2%的延迟降低,并在对抗性提示检测中达到95%的准确率。
English: This paper introduces the Number of Thoughts (NofT) metric to assess task difficulty and enhance prompt routing for LLMs, achieving a 2% latency reduction and 95% accuracy in detecting adversarial prompts.

Authors:Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming Zhang
Title: Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Abstract:
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.
中文摘要:本综述通过以方法论为中心的分类体系,系统解构了LLM智能体系统的架构基础、协作机制与演化路径,为理解其发展提供了统一框架并指明了未来研究方向。
English Summary: This survey systematically analyzes LLM agent systems by examining their architecture, collaboration, and evolution, offering a unified framework for understanding their development and identifying future research directions.

Authors:Xiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, Linlin Shen
Title: FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.
中文: FaceBench是一个专门设计用于评估多模态大语言模型面部感知能力的数据集,通过分层属性和视觉问答对进行测试,所开发的Face-LLaVA模型在性能上显著优于现有开源模型,并与商业模型相媲美。
English: FaceBench is a specialized dataset introduced to evaluate multimodal large language models' face perception abilities through hierarchical attributes and visual question-answering pairs, with the developed Face-LLaVA model showing superior performance compared to existing models.

Authors:Changjian Zhou, Yuexi Qiu, Tongtong Ling, Jiafeng Li, Shuanghe Liu, Xiangjing Wang, Jia Song, Wensheng Xiang
Title: CMADiff: Cross-Modal Aligned Diffusion for Controllable Protein Generation
Abstract:
AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text-based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning-based module that aligns text descriptions with protein features, enabling text-driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at https://github.com/HPC-NEAU/PhysChemDiff.
中文摘要:CMADiff是一种创新框架,通过潜在扩散过程将蛋白质的物理化学特性与文本描述对齐,实现了可控的蛋白质生成,其性能优于现有基准并展现出强大的应用潜力。
English Summary: CMADiff is a novel framework that enables controllable protein generation by aligning physicochemical properties with text descriptions using a latent diffusion process, outperforming existing benchmarks and showing strong application potential.

Authors:Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss
Title: Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving
Abstract:
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.
Chinese: 本研究提出了一种无需依赖投影或解耦模型的新方法,能够生成逼真的三维语义场景尺度数据,不仅提升了数据合成质量,还通过将其作为训练数据有效提高了语义分割模型的性能。
English: This study introduces a novel method for generating realistic 3D semantic scene-scale data without relying on projections or decoupled models, enhancing both data synthesis quality and the performance of semantic segmentation models when used as training data.

Authors:David P. Hofmeyr
Title: Nearest Neighbour Equilibrium Clustering
Abstract:
A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from https://github.com/DavidHofmeyr/NNEC.
中文摘要:一种基于最近邻的新聚类算法通过平衡簇的大小和内聚性来定义簇,提供自动化模型选择并生成高质量结果,附有可用的R代码。
English Summary: A new clustering algorithm using nearest neighbors defines clusters by balancing size and cohesiveness, offering automated model selection and high-quality results with available R code.

Authors:Tong Nie, Jian Sun, Wei Ma
Title: Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Abstract:
Modern transportation systems face pressing challenges due to increasing demand, dynamic environments, and heterogeneous information integration. The rapid evolution of Large Language Models (LLMs) offers transformative potential to address these challenges. Extensive knowledge and high-level capabilities derived from pretraining evolve the default role of LLMs as text generators to become versatile, knowledge-driven task solvers for intelligent transportation systems. This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation into four synergetic dimensions: information processors, knowledge encoders, component generators, and decision facilitators. Through a unified taxonomy, we systematically elucidate how LLMs bridge fragmented data pipelines, enhance predictive analytics, simulate human-like reasoning, and enable closed-loop interactions across sensing, learning, modeling, and managing tasks in transportation systems. For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization, highlighting how emergent capabilities of LLMs such as in-context learning and step-by-step reasoning can enhance the operation and management of transportation systems. We further curate practical guidance, including available resources and computational guidelines, to support real-world deployment. By identifying challenges in existing LLM-based solutions, this survey charts a roadmap for advancing LLM-driven transportation research, positioning LLMs as central actors in the next generation of cyber-physical-social mobility ecosystems. Online resources can be found in the project page: https://github.com/tongnie/awesome-llm4tr.
中文摘要:本综述提出LLM4TR框架,系统阐述大语言模型通过信息处理、知识编码、组件生成和决策支持四大协同维度,在交通预测、自动驾驶等场景中提升交通系统智能化水平,并规划了技术发展路线图。
English Summary: This survey introduces the LLM4TR framework, demonstrating how Large Language Models can transform transportation systems by processing information, encoding knowledge, generating components, and facilitating decisions across various applications while addressing current challenges and future directions.

Authors:Erik Wallin, Fredrik Kahl, Lars Hammarstrand
Title: ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
Abstract:
Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at https://github.com/walline/prohoc.
中文摘要:本文提出了一种在类别层次结构中检测并分类分布外样本的框架,能够将未知样本归类至层次结构的内部节点,同时保持对已知类别的叶节点准确预测,并在多个数据集上验证了方法的有效性。
English Summary: This paper introduces a framework that detects out-of-distribution (OOD) samples by classifying them into appropriate internal nodes of a class hierarchy, while maintaining accurate leaf-level predictions for in-distribution data, demonstrating effectiveness across multiple datasets.

Authors:Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Abstract:
In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1, OpenAI's o3-mini and Gemini 2.5 Pro Exp demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the benchmark, evaluation code, detailed results and a data visualization tool at https://github.com/RUCAIBox/OlymMATH.
中文: 针对现有数学推理基准饱和的问题,OlymMATH推出了一个具有挑战性的奥林匹克级别数学基准,包含200道双语题目和两个难度级别,揭示了顶尖模型的显著局限性,并支持全面的双语评估。
English: To address the saturation of existing benchmarks, OlymMATH introduces a challenging Olympiad-level mathematical benchmark with 200 bilingual problems across two difficulty tiers, revealing significant limitations in state-of-the-art models and enabling comprehensive bilingual evaluation.

Authors:Zhenxiang Ma, Zhenyu Yang, Miao Tao, Yuanzhen Zhou, Zeyu He, Yuchang Zhang, Rong Fu, Hengjie Li
Title: LandMarkSystem Technical Report
Abstract:
3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative algorithms.This comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction tasks.To facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: https://github.com/InternLandMark/LandMarkSystem.
Chinese: 本文提出LandMarkSystem这一新型计算框架,通过分布式计算和优化算子支持多种神经辐射场与3D高斯泼溅结构,有效提升多尺度三维场景重建与渲染效率,解决了现有深度学习框架在场景质量与规模方面的局限性。
English: This paper introduces LandMarkSystem, a novel computing framework that enhances multi-scale 3D scene reconstruction and rendering by supporting various Neural Radiance Fields and 3D Gaussian Splatting structures through distributed computing and optimized operators, addressing limitations in existing deep learning frameworks.

Authors:Yehui Shen, Lei Zhang, Qingqiu Li, Xiongwei Zhao, Yue Wang, Huimin Lu, Xieyuanli Chen
Title: UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation
Abstract:
Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{https://github.com/nubot-nudt/UGNA-VPR}{https://github.com/nubot-nudt/UGNA-VPR}.
中文: 本文提出一种训练范式,通过不确定性估计和基于NeRF的数据增强从现有数据集中生成合成视图,显著提升了视觉位置识别在不同环境中的性能表现。
English: This paper introduces a training paradigm that enhances visual place recognition by using uncertainty estimation and NeRF-based data augmentation to generate synthetic views from existing datasets, significantly improving performance across various environments.

Authors:Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, Liqiang Nie
Title: FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.
Chinese: FineCIR提出了一种新颖的框架和细粒度数据集,解决了组合图像检索中粗粒度修改文本的局限性,通过精确捕捉详细修改意图显著提升了检索精度。
English: FineCIR introduces a novel framework and fine-grained datasets to address the limitations of coarse modification text in composed image retrieval, significantly improving retrieval accuracy by precisely capturing detailed modification intents.

Authors:Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, Shujian Huang
Title: R-PRM: Reasoning-Driven Process Reward Modeling
Abstract:
Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.
中文: R-PRM通过利用有限标注生成种子数据、优化偏好和采用推理时扩展,显著提升了数学推理评估性能,在多个基准测试中远超现有方法。
English: R-PRM enhances mathematical reasoning evaluation by generating seed data from limited annotations, optimizing preferences, and employing inference-time scaling, achieving significant performance gains over existing methods.

Authors:Hanyue Tu, Siqi Wu, Li Li, Wengang Zhou, Houqiang Li
Title: Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression
Abstract:
Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC.
中文摘要:本文提出了一种基于多尺度可逆神经网络的变速率图像压缩模型,通过双射映射克服了自编码器的局限性,在广泛码率范围内以单一模型实现了优于现有方法的性能,尤其在高速率下首次超越VVC标准。
English Summary: This paper introduces a variable-rate image compression model using a multi-scale invertible neural network that overcomes autoencoder limitations by bijectively mapping images into latent representations, achieving state-of-the-art performance across a wide bit rate range with a single model.

Authors:Jiaqi Han, Jingwen Ye, Shunyu Liu, Haofei Zhang, Jie Song, Zunlei Feng, Mingli Song
Title: Reinforced Model Merging
Abstract:
The success of large language models has garnered widespread attention for model merging techniques, especially training-free methods which combine model capabilities within the parameter space. However, two challenges remain: (1) uniform treatment of all parameters leads to performance degradation; (2) search-based algorithms are often inefficient. In this paper, we present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. These components interact to execute layer-wise merging actions, aiming to search the optimal merging architecture. Notably, RMM operates without any gradient computations on the original models, rendering it feasible for edge devices. Furthermore, by utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times. Extensive experiments demonstrate that RMM achieves state-of-the-art performance across various vision and NLP datasets and effectively overcomes the limitations of the existing baseline methods. Our code is available at https://github.com/WuDiHJQ/Reinforced-Model-Merging.
Chinese: 本文提出强化模型融合(RMM)框架,通过智能体执行分层融合操作以搜索最优架构,无需梯度计算即可在边缘设备上高效实现顶尖性能,并克服现有基线方法的局限性。
English: The paper introduces Reinforced Model Merging (RMM), a training-free framework that uses an agent to perform layer-wise merging for optimal architecture search, achieving state-of-the-art performance efficiently without gradient computations on edge devices.

Authors:Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, Yue Liao
Title: Vision-to-Music Generation: A Survey
Abstract:
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.
中文: 本文系统综述了视觉到音乐的生成领域,通过分析技术特征、方法体系、数据集与评估指标,指出当前挑战与未来方向,旨在推动多模态人工智能应用的创新发展。
English: This paper systematically reviews the vision-to-music generation field, analyzing technical characteristics, methodologies, datasets, and evaluation metrics while identifying current challenges and future directions to inspire innovation in multimodal AI applications.

Authors:Bashar Tahir, Philipp Svoboda, Markus Rupp
Title: PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication
Abstract:
Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
中文: PLAIN是一种基于张量的估计架构,通过数据压缩、解耦参数估计和结果融合,高效处理ISAC中的高维感知问题,实现了低复杂度和超分辨率性能。
English: PLAIN is a tensor-based estimation architecture that efficiently handles high-dimensional sensing in ISAC by compressing data, decoupling parameter estimation, and fusing results, achieving low complexity and super-resolution performance.

Authors:Yuntao Gui, Peiqi Yin, Xiao Yan, Chaorui Zhang, Weixi Zhang, James Cheng
Title: PilotANN: Memory-Bounded GPU Acceleration for Vector Search
Abstract:
Approximate Nearest Neighbor Search (ANNS) has become fundamental to modern deep learning applications, having gained particular prominence through its integration into recent generative models that work with increasingly complex datasets and higher vector dimensions. Existing CPU-only solutions, even the most efficient graph-based ones, struggle to meet these growing computational demands, while GPU-only solutions face memory constraints. As a solution, we propose PilotANN, a hybrid CPU-GPU system for graph-based ANNS that utilizes both CPU's abundant RAM and GPU's parallel processing capabilities. Our approach decomposes the graph traversal process of top-$k$ search into three stages: GPU-accelerated subgraph traversal using SVD-reduced vectors, CPU refinement and precise search using complete vectors. Furthermore, we introduce fast entry selection to improve search starting points while maximizing GPU utilization. Experimental results demonstrate that PilotANN achieves $3.9 - 5.4 \times$ speedup in throughput on 100-million scale datasets, and is able to handle datasets up to $12 \times$ larger than the GPU memory. We offer a complete open-source implementation at https://github.com/ytgui/PilotANN.
中文: PilotANN是一种混合CPU-GPU系统,通过将图遍历分解为GPU加速子图搜索和CPU精确搜索,实现了最高5.4倍的吞吐量提升,并能处理比GPU内存大12倍的数据集。
English: PilotANN is a hybrid CPU-GPU system that accelerates approximate nearest neighbor search by decomposing graph traversal into GPU-accelerated subgraph search and CPU refinement, achieving up to 5.4× throughput improvement and handling datasets 12× larger than GPU memory.

Authors:Yimin Xu, Fan Yang, Bin Xu
Title: DSU-Net:An Improved U-Net Model Based on DINOv2 and SAM2 with Multi-scale Cross-model Feature Enhancement
Abstract:
Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta's Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image segmentation.Project page: https://github.com/CheneyXuYiMin/SAM2DINO-Seg
中文: 本文提出了一种由DINOv2引导的SAM2多尺度特征协作框架,通过轻量级适配器和注意力机制注入跨领域知识,无需昂贵训练即可在专业图像分割任务中实现领先性能。
English: This paper introduces a multi-scale feature collaboration framework guided by DINOv2 for SAM2, which enhances specialized image segmentation by integrating cross-domain knowledge through lightweight adapters and attention mechanisms, achieving state-of-the-art performance without costly training.

Authors:Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Yifei Chen, Shuangli Du
Title: VADMamba: Exploring State Space Models for Fast Video Anomaly Detection
Abstract:
Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at https://github.com/jLooo/VADMamba.
中文: 本研究提出VADMamba这一新型视频异常检测方法,通过基于Mamba模型的帧预测与光流重建双任务架构,在多个基准测试中实现了推理速度与检测精度的双重提升。
English: The study introduces VADMamba, a novel video anomaly detection method that utilizes Mamba-based models for frame prediction and optical flow reconstruction, achieving enhanced inference speed and accuracy across multiple benchmarks.

Authors:Junjie Chen, Weilong Chen, Yifan Zuo, Yuming Fang
Title: Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
Abstract:
Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at https://github.com/chenbys/FMMP.
中文: 本文提出了一种新颖框架,通过可变形注意力机制从支持和查询图像中循环挖掘细粒度结构感知特征,在姿态估计任务上显著超越了现有方法。
English: This paper introduces a novel framework that recurrently mines fine-grained and structure-aware features from both support and query images using a deformable attention mechanism, significantly outperforming existing methods on pose estimation tasks.

Authors:Jiajie Quan, Ao Tong, Yuxuan Cai, Xinwei He, Yulong Wang, Yang Zhou
Title: Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection
Abstract:
In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git
中文摘要:提出的Omni-AD框架通过采用双分支解码器,分别使用序列化自注意力机制和深度可分离卷积来全面学习全局与局部正常模式,有效解决了多类无监督异常检测中的学习捷径问题,在基准测试中实现了最优性能。
English summary: The proposed Omni-AD framework addresses learning shortcuts in multi-class unsupervised anomaly detection by employing a two-branch decoder that comprehensively captures both global and local normal patterns through serialized self-attention and depth-separable convolutions, achieving state-of-the-art performance on benchmarks.

Authors:Yun Zhu, Le Hui, Hang Yang, Jianjun Qian, Jin Xie, Jian Yang
Title: Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection
Abstract:
Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.
中文摘要:本文提出了一种统一的稀疏监督三维物体检测方法,通过类别原型学习有效利用未标记物体,在室内外场景中仅需每场景单个标注即可实现接近全监督方法的性能。
English Summary: This paper introduces a unified sparse supervised 3D object detection method that utilizes class prototypes to effectively mine unlabeled objects for both indoor and outdoor scenes, achieving state-of-the-art performance with minimal labeled data.

Authors:Haoming Xu, Shuxun Wang, Yanqiu Zhao, Yi Zhong, Ziyan Jiang, Ningyuan Zhao, Shumin Deng, Huajun Chen, Ningyu Zhang
Title: ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
Abstract:
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.
中文: ZJUKLAB团队针对SemEval-2025任务四提出了基于模型融合的遗忘系统,在26支队伍中荣获第二名,其方法在有效消除敏感内容的同时,揭示了当前评估指标的不足,并呼吁建立更全面的评估体系。
English: The ZJUKLAB team introduced a model merging-based unlearning system for SemEval-2025 Task 4, achieving second place by effectively removing sensitive content while highlighting limitations in current evaluation metrics and calling for improved assessment methods.

Authors:Yusong Hu, Zichen Liang, Fei Yang, Qibin Hou, Xialei Liu, Ming-Ming Cheng
Title: KAC: Kolmogorov-Arnold Classifier for Continual Learning
Abstract:
Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN's spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at https://github.com/Ethanhuhuhu/KAC.
中文: 本文提出基于KAN结构的Kolmogorov-Arnold分类器(KAC),通过采用样条函数和径向基函数增强稳定性与兼容性,在持续学习中替代线性分类器,在多项基准测试中均表现出性能提升和鲁棒性优势。
English: This paper introduces the Kolmogorov-Arnold Classifier (KAC), a novel classifier based on KAN structure that replaces linear classifiers in continual learning, demonstrating improved performance and robustness across various benchmarks through enhanced stability and compatibility with spline and radial basis functions.

Authors:Ooha Lakkadi Reddy
Title: Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
Abstract:
This thesis employs a hybrid CNN-Transformer architecture, alongside a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (0.635) than to the Bronze Age Proto-Cuneiform (0.102) or Proto-Elamite (0.078). Contrary to expectations, when measured through direct script-to-script embedding comparisons, the Indus script maps closer to Tibetan-Yi Corridor scripts with a mean cosine similarity of 0.930 (CI: [0.917, 0.942]) than to contemporaneous West Asian signaries, which recorded mean similarities of 0.887 (CI: [0.863, 0.911]) and 0.855 (CI: [0.818, 0.891]). Across dimensionality reduction and clustering methods, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. These computational findings align with observed pictorial parallels in numeral systems, gender markers, and iconographic elements. Archaeological evidence of contact networks along the ancient Shu-Shendu road, coinciding with the Indus Civilization's decline, provides a plausible transmission pathway. While alternate explanations cannot be ruled out, the specificity and consistency of similarities suggest more complex cultural transmission networks between South and East Asia than previously recognized.
中文: 本研究采用混合CNN-Transformer模型,发现印度河文字与藏彝走廊文字在视觉形态上存在显著相似性,其关联强度远超同期西亚文字,暗示了沿古代通道可能存在文化传播网络。
English: This study uses a hybrid CNN-Transformer model to reveal that the Indus Valley script shares significantly stronger visual and structural similarities with Tibetan-Yi Corridor scripts than with contemporaneous West Asian scripts, suggesting potential historical cultural transmission along ancient pathways.

Authors:Judy X Yang, Jing Wang, Zhuanfeng, Li, Chenhong Sui Zekun Long, Jun Zhou
Title: HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion
Abstract:
The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at https://github.com/Judyxyang/HSLiNets.
中文: 本研究揭示了在高光谱与激光雷达数据融合中,光谱波段顺序对分类精度具有显著影响,并提出了一种新型融合架构,通过自适应整合多种波段序列来增强特征表征能力,从而超越现有模型的性能。
English: This study reveals that the order of spectral bands significantly influences classification accuracy in hyperspectral and LiDAR data fusion, leading to the development of a novel architecture that adaptively integrates multiple band sequences to enhance feature representation and outperform existing models.

Authors:Caspar Meijer, Jiyue Huang, Shreshtha Sharma, Elena Lazovik, Lydia Y. Chen
Title: TS-Inverse: A Gradient Inversion Attack Tailored for Federated Time Series Forecasting Models
Abstract:
Federated learning (FL) for time series forecasting (TSF) enables clients with privacy-sensitive time series (TS) data to collaboratively learn accurate forecasting models, for example, in energy load prediction. Unfortunately, privacy risks in FL persist, as servers can potentially reconstruct clients' training data through gradient inversion attacks (GIA). Although GIA is demonstrated for image classification tasks, little is known about time series regression tasks. In this paper, we first conduct an extensive empirical study on inverting TS data across 4 TSF models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of TS data. We then propose TS-Inverse, a novel GIA that improves the inversion of TS data by (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function that incorporates periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least a 2x-10x improvement in terms of the sMAPE metric over existing GIA methods on TS data. Code repository: https://github.com/Capsar/ts-inverse
Chinese: 针对时间序列预测中的联邦学习隐私风险,本研究提出TS-Inverse方法,通过引入分位数预测和正则化技术,显著提升了梯度反演攻击对时序数据的重构精度,相比现有方法性能提升2-10倍。
English: Federated learning for time series forecasting faces privacy risks from gradient inversion attacks, which this study addresses by proposing TS-Inverse, a novel method that significantly improves data reconstruction accuracy by incorporating quantile predictions and regularization techniques.

Authors:Amaya Gallagher-Syed, Henry Senior, Omnia Alwazzan, Elena Pontarini, Michele Bombardieri, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Luca Rossi, Gregory Slabaugh
Title: BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
Abstract:
The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren's Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: https://github.com/AmayaGS/BioX-CPath.
Chinese: BioX-CPath是一种可解释的图神经网络,通过引入染色感知注意力池化模块,在全切片图像分类中实现最优性能,同时提供生物学可解释性见解,适用于临床诊断。
English: BioX-CPath is an explainable graph neural network for whole slide image classification that introduces a stain-aware attention pooling module, achieving state-of-the-art performance while providing biologically interpretable insights for clinical applications.

Authors:Syed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, Xudong Jiang
Title: Exploiting Temporal State Space Sharing for Video Semantic Segmentation
Abstract:
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at https://github.com/Ashesham/TV3S.git.
Chinese: TV3S架构通过引入时序视频状态空间共享模型和选择性门控机制,有效跨帧传播信息以增强视频语义分割,在基准数据集上实现了优越性能,同时平衡了准确性与效率。
English: The TV3S architecture introduces a temporal video state space sharing model with a selective gating mechanism that enhances video semantic segmentation by efficiently propagating information across frames, achieving superior performance on benchmark datasets while balancing accuracy and efficiency.

Authors:Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, Eunho Yang
Title: Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Abstract:
Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment. We explore various off-the-shelf visual and textual transformation techniques for OOD-ifying the harmful inputs. Notably, we observe that even simple mixing-based techniques such as image mixup prove highly effective in increasing the uncertainty of the model, thereby facilitating the bypass of the safety alignment. Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and o1 with high attack success rate, which previous attack approaches have consistently struggled to jailbreak. Code is available at https://github.com/naver-ai/JOOD.
中文: 大型语言模型和多模态模型存在安全漏洞,通过超出分布范围的输入可增加模型不确定性,从而绕过安全防护,JOOD框架有效实现了此类攻击。
English: Large language models and multimodal models are vulnerable to jailbreaking through out-of-distribution inputs that increase model uncertainty, bypassing safety alignments as demonstrated by the proposed JOOD framework.

Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准,有效提升了跨客户端协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.

Authors:Shuhao Zhang, Bo Cheng, Jiale Han, Yuli Chen, Zhixuan Wu, Changbao Li, Pingli Gu
Title: CEFW: A Comprehensive Evaluation Framework for Watermark in Large Language Models
Abstract:
Text watermarking provides an effective solution for identifying synthetic text generated by large language models. However, existing techniques often focus on satisfying specific criteria while ignoring other key aspects, lacking a unified evaluation. To fill this gap, we propose the Comprehensive Evaluation Framework for Watermark (CEFW), a unified framework that comprehensively evaluates watermarking methods across five key dimensions: ease of detection, fidelity of text quality, minimal embedding cost, robustness to adversarial attacks, and imperceptibility to prevent imitation or forgery. By assessing watermarks according to all these key criteria, CEFW offers a thorough evaluation of their practicality and effectiveness. Moreover, we introduce a simple and effective watermarking method called Balanced Watermark (BW), which guarantees robustness and imperceptibility through balancing the way watermark information is added. Extensive experiments show that BW outperforms existing methods in overall performance across all evaluation dimensions. We release our code to the community for future research. https://github.com/DrankXs/BalancedWatermark.
中文摘要:本文提出CEFW统一框架从五个关键维度全面评估文本水印方法,同时实验表明所提出的平衡水印方法在整体性能上优于现有技术。
English Summary: The CEFW framework is introduced to comprehensively evaluate text watermarking methods across five key dimensions, while the proposed Balanced Watermark method demonstrates superior overall performance in experiments.

Authors:Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen
Title: Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark
Abstract:
Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models' ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our code and data are available at: https://github.com/VILA-Lab/Mobile-MMLU.
中文: Mobile-MMLU基准数据集专为移动智能设计,包含16,186个涵盖80个领域的问题,通过评估模型在资源受限环境下的延迟、能耗等关键指标,为移动端优化的大语言模型开发提供标准化框架。
English: The Mobile-MMLU benchmark dataset is introduced to evaluate large language models in mobile contexts, addressing unique challenges like resource constraints and user interaction biases with 16,186 questions across 80 fields, while prioritizing privacy and on-device performance metrics.

Authors:Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
Title: Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Abstract:
We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
中文摘要:Free4D提出了一种无需调优的框架,通过提取预训练基础模型从单张图像生成一致的4D场景,实现了无需昂贵训练的高效通用实时渲染能力。
English Summary: Free4D is a tuning-free framework that generates consistent 4D scenes from single images by distilling pre-trained foundation models, enabling efficient and generalizable real-time rendering without expensive training.

Authors:Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Title: Understanding R1-Zero-Like Training: A Critical Perspective
Abstract:
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
中文摘要:本研究分析了R1-Zero训练的核心组件,发现GRPO存在优化偏差并提出无偏的Dr. GRPO方法,在保持推理性能的同时提升效率,最终通过极简配方使7B模型在AIME 2024上达到43.3%的最新最优准确率。
English Summary: This study analyzes R1-Zero training components and identifies optimization bias in GRPO, introducing Dr. GRPO to improve efficiency while achieving state-of-the-art 43.3% accuracy on AIME 2024 with a minimalist 7B model recipe.

Authors:Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Title: Understanding R1-Zero-Like Training: A Critical Perspective
Abstract:
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
中文摘要:本研究分析了R1-Zero训练的核心组件,发现GRPO存在优化偏差并提出无偏的Dr. GRPO方法,在保持推理性能的同时提升效率,最终通过极简配方使7B模型在AIME 2024上达到43.3%的最新最优准确率。
English Summary: This study analyzes R1-Zero training components and identifies optimization bias in GRPO, introducing Dr. GRPO to improve efficiency while achieving state-of-the-art 43.3% accuracy on AIME 2024 with a minimalist 7B model recipe.

Authors:Yulu Pan, Ce Zhang, Gedas Bertasius
Title: BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
Abstract:
We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. Dataset and code are available at https://github.com/yulupan00/BASKET.
中文:BASKET是一个大规模篮球视频数据集,包含来自全球32,232名球员的4,477小时视频,旨在通过20项精细篮球技能评估推动视频识别模型在长序列细粒度分析能力的发展。
English: BASKET is a comprehensive basketball video dataset featuring 4,477 hours of footage from 32,232 globally diverse players, designed to challenge video recognition models with fine-grained skill estimation across 20 specific basketball abilities.

Authors:Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang
Title: ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.
中文摘要:本研究提出知识编辑方法和ADS-Edit数据集,通过针对性修正模型行为来解决自动驾驶中交通知识误解与复杂路况问题,无需完整重新训练即可提升多模态大模型性能。
English Summary: This study introduces Knowledge Editing and the ADS-Edit dataset to enhance Large Multimodal Models for autonomous driving by addressing traffic knowledge gaps and complex road scenarios without full retraining.

Authors:Chen Tang, Xinzhu Ma, Encheng Su, Xiufeng Song, Xiaohong Liu, Wei-Hong Li, Lei Bai, Wanli Ouyang, Xiangyu Yue
Title: UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Abstract:
Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.
中文: UniSTD是一种基于Transformer的统一框架,通过任务无关的预训练和专业化联合训练,实现了跨任务、可扩展的时空学习,支持多领域应用。
English: UniSTD is a unified Transformer-based framework that leverages task-agnostic pretraining and specialized joint training to enable scalable, cross-task spatiotemporal learning across multiple domains.

Authors:Masane Fuchi, Tomohiro Takagi
Title: RecTable: Fast Modeling Tabular Data with Rectified Flow
Abstract:
Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at https://github.com/fmp453/rectable.
Chinese: RecTable采用简化的整流流模型,具有简单的架构和训练策略,在表格数据生成中实现了与最先进的扩散和基于分数的模型相竞争的性能,同时显著减少了训练时间。
English: RecTable introduces a simplified rectified flow model with a straightforward architecture and training strategies, achieving competitive performance in tabular data generation while significantly reducing training time compared to state-of-the-art diffusion and score-based models.

Authors:Yankai Chen, Taotao Wang, Yixiang Fang, Yunyu Xiao
Title: Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization
Abstract:
Node importance estimation, a classical problem in network analysis, underpins various web applications. Previous methods either exploit intrinsic topological characteristics, e.g., graph centrality, or leverage additional information, e.g., data heterogeneity, for node feature enhancement. However, these methods follow the supervised learning setting, overlooking the fact that ground-truth node-importance data are usually partially labeled in practice. In this work, we propose the first semi-supervised node importance estimation framework, i.e., EASING, to improve learning quality for unlabeled data in heterogeneous graphs. Different from previous approaches, EASING explicitly captures uncertainty to reflect the confidence of model predictions. To jointly estimate the importance values and uncertainties, EASING incorporates DJE, a deep encoder-decoder neural architecture. DJE introduces distribution modeling for graph nodes, where the distribution representations derive both importance and uncertainty estimates. Additionally, DJE facilitates effective pseudo-label generation for the unlabeled data to enrich the training samples. Based on labeled and pseudo-labeled data, EASING develops effective semi-supervised heteroscedastic learning with varying node uncertainty regularization. Extensive experiments on three real-world datasets highlight the superior performance of EASING compared to competing methods. Codes are available via https://github.com/yankai-chen/EASING.
Chinese: 本文提出EASING,首个半监督节点重要性评估框架,通过深度编码器-解码器架构对异质图中未标记数据进行分布建模和伪标签生成,联合估计节点重要性及预测不确定性。
English: This paper introduces EASING, a semi-supervised framework that estimates node importance and uncertainty in heterogeneous graphs using a deep encoder-decoder architecture to handle partially labeled data through distribution modeling and pseudo-label generation.

Authors:Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang
Title: Vision as LoRA
Abstract:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.
Chinese: VoRA是一种创新方法,通过将视觉专用LoRA层直接集成到大型语言模型中,使其转变为多模态模型,实现了推理过程中参数的无缝融合和任意分辨率输入的处理能力。
English: VoRA is a novel approach that transforms a large language model into a multimodal model by integrating vision-specific LoRA layers directly into the LLM, enabling seamless parameter merging and flexible input resolution processing during inference.

Authors:Gongzhu Yin, Hongli Zhang, Yuchen Yang, Yi Luo
Title: Inductive Link Prediction on N-ary Relational Facts via Semantic Hypergraph Reasoning
Abstract:
N-ary relational facts represent semantic correlations among more than two entities. While recent studies have developed link prediction (LP) methods to infer missing relations for knowledge graphs (KGs) containing n-ary relational facts, they are generally limited to transductive settings. Fully inductive settings, where predictions are made on previously unseen entities, remain a significant challenge. As existing methods are mainly entity embedding-based, they struggle to capture entity-independent logical rules. To fill in this gap, we propose an n-ary subgraph reasoning framework for fully inductive link prediction (ILP) on n-ary relational facts. This framework reasons over local subgraphs and has a strong inductive inference ability to capture n-ary patterns. Specifically, we introduce a novel graph structure, the n-ary semantic hypergraph, to facilitate subgraph extraction. Moreover, we develop a subgraph aggregating network, NS-HART, to effectively mine complex semantic correlations within subgraphs. Theoretically, we provide a thorough analysis from the score function optimization perspective to shed light on NS-HART's effectiveness for n-ary ILP tasks. Empirically, we conduct extensive experiments on a series of inductive benchmarks, including transfer reasoning (with and without entity features) and pairwise subgraph reasoning. The results highlight the superiority of the n-ary subgraph reasoning framework and the exceptional inductive ability of NS-HART. The source code of this paper has been made publicly available at https://github.com/yin-gz/Nary-Inductive-SubGraph.
中文: 本文提出了一种用于n元关系事实全归纳链接预测的n元子图推理框架,通过创新的n元语义超图和NS-HART网络有效挖掘复杂语义关联,展现出卓越的归纳推理能力。
English: This paper introduces an n-ary subgraph reasoning framework for fully inductive link prediction on n-ary relational facts, which utilizes a novel n-ary semantic hypergraph and the NS-HART network to effectively capture complex semantic patterns and demonstrate strong inductive capabilities.

Authors:Han Wu, Yuxuan Yao, Shuqi Liu, Zehua Liu, Xiaojin Fu, Xiongwei Han, Xing Li, Hui-Ling Zhen, Tao Zhong, Mingxuan Yuan
Title: Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging
Abstract:
The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model's ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github https://github.com/hahahawu/Long-to-Short-via-Model-Merging.
中文: 模型融合通过结合系统一的快速反应与系统二的深度推理,有效减少大语言模型的过度思考问题,能在保持或提升性能的同时将平均响应长度缩短高达55%。
English: Model merging offers an efficient solution to reduce overthinking in large language models by combining System 1's speed with System 2's depth, achieving up to 55% shorter responses while maintaining or improving performance.

Authors:Hao Fu, Hanbin Zhao, Jiahua Dong, Henghui Ding, Chao Zhang, Hui Qian
Title: IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting
Abstract:
Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Task Incremental Learning (MTIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived. Without access to previously seen tasks and unseen tasks, memory-constrained MTIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MTIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) strategy enhances adaptation to new tasks while mitigating forgetting by adaptively assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. The source codes are available at https://github.com/FerdinandZJU/IAP.
Chinese: 本文提出了一种实例感知提示框架,通过自适应实例级策略优化视觉语言模型中的多领域任务增量学习提示设计,在增强新任务适应性的同时有效缓解遗忘问题。
English: This paper introduces an Instance-Aware Prompting (IAP) framework to optimize prompt designs for multi-domain task incremental learning in vision-language models, enhancing new task adaptation while mitigating forgetting through adaptive instance-level strategies.

Authors:Trung Duc Ha, Sidney Bender
Title: Diffusion Counterfactuals for Image Regressors
Abstract:
Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.
中文摘要:针对图像回归任务,本文采用基于扩散的像素空间和潜在空间模型生成反事实解释,虽能提供对模型决策的可解释性洞察并揭示伪相关性,但在稀疏性和质量方面相比分类任务更具挑战性。
English Summary: Counterfactual explanations for image regression tasks are generated using diffusion-based models in pixel and latent spaces, producing realistic insights into model decisions while revealing challenges in achieving sparsity and quality compared to classification tasks.

Authors:Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, Bianca Zadrozny
Title: TerraTorch: The Geospatial Foundation Models Toolkit
Abstract:
TerraTorch is a fine-tuning and benchmarking toolkit for Geospatial Foundation Models built on PyTorch Lightning and tailored for satellite, weather, and climate data. It integrates domain-specific data modules, pre-defined tasks, and a modular model factory that pairs any backbone with diverse decoder heads. These components allow researchers and practitioners to fine-tune supported models in a no-code fashion by simply editing a training configuration. By consolidating best practices for model development and incorporating the automated hyperparameter optimization extension Iterate, TerraTorch reduces the expertise and time required to fine-tune or benchmark models on new Earth Observation use cases. Furthermore, TerraTorch directly integrates with GEO-Bench, allowing for systematic and reproducible benchmarking of Geospatial Foundation Models. TerraTorch is open sourced under Apache 2.0, available at https://github.com/IBM/terratorch, and can be installed via pip install terratorch.
中文: TerraTorch 是一个基于 PyTorch Lightning 的工具包,专为卫星、气象和气候数据设计,用于微调和基准测试地理空间基础模型,支持无代码定制和高效模型优化。
English: TerraTorch is a PyTorch Lightning-based toolkit designed for fine-tuning and benchmarking geospatial foundation models using satellite, weather, and climate data, enabling no-code customization and efficient model optimization.

Authors:Henrik Christiansen, Takashi Maruyama, Federico Errica, Viktor Zaverkin, Makoto Takamoto, Francesco Alesiani
Title: Fast, Modular, and Differentiable Framework for Machine Learning-Enhanced Molecular Simulations
Abstract:
We present an end-to-end differentiable molecular simulation framework (DIMOS) for molecular dynamics and Monte Carlo simulations. DIMOS easily integrates machine-learning-based interatomic potentials and implements classical force fields including particle-mesh Ewald electrostatics. Thanks to its modularity, both classical and machine-learning-based approaches can be easily combined into a hybrid description of the system (ML/MM). By supporting key molecular dynamics features such as efficient neighborlists and constraint algorithms for larger time steps, the framework bridges the gap between hand-optimized simulation engines and the flexibility of a PyTorch implementation. The superior performance and the high versatility is probed in different benchmarks and applications, with speed-up factors of up to $170\times$. The advantage of differentiability is demonstrated by an end-to-end optimization of the proposal distribution in a Markov Chain Monte Carlo simulation based on Hamiltonian Monte Carlo. Using these optimized simulation parameters a $3\times$ acceleration is observed in comparison to ad-hoc chosen simulation parameters. The code is available at https://github.com/nec-research/DIMOS.
中文: DIMOS是一个端到端可微分分子模拟框架,它融合了机器学习势与经典力场,实现了高达170倍的加速,并通过优化蒙特卡洛模拟参数展示了3倍的效率提升。
English: DIMOS is an end-to-end differentiable molecular simulation framework that integrates machine-learning potentials with classical force fields, achieving up to 170x speed-up and demonstrating a 3x acceleration through optimized parameters in Monte Carlo simulations.

Authors:Yijiong Yu
Title: Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Abstract:
Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning steps exist, we decode multiple tokens per forward pass via a tree-like attention mask within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves up to nearly 100\% speedup in decoding while basically maintaining the answer quality.
Chinese: 近期推理模型虽提升准确性但效率低下,我们通过树状注意力掩码并行化推理步骤,在不影响质量的情况下实现近100%的加速。
English: Recent advances in reasoning models improve accuracy but are inefficient, so we accelerate them by parallelizing steps with a tree-like attention mask, achieving nearly 100% speedup without compromising quality.

Authors:Jinghui Yuan, Fangyuan Xie, Feiping Nie, Xuelong Li
Title: Riemannian Optimization on Relaxed Indicator Matrix Manifold
Abstract:
The indicator matrix plays an important role in machine learning, but optimizing it is an NP-hard problem. We propose a new relaxation of the indicator matrix and prove that this relaxation forms a manifold, which we call the Relaxed Indicator Matrix Manifold (RIM manifold). Based on Riemannian geometry, we develop a Riemannian toolbox for optimization on the RIM manifold. Specifically, we provide several methods of Retraction, including a fast Retraction method to obtain geodesics. We point out that the RIM manifold is a generalization of the double stochastic manifold, and it is much faster than existing methods on the double stochastic manifold, which has a complexity of \( \mathcal{O}(n^3) \), while RIM manifold optimization is \( \mathcal{O}(n) \) and often yields better results. We conducted extensive experiments, including image denoising, with millions of variables to support our conclusion, and applied the RIM manifold to Ratio Cut, we provide a rigorous convergence proof and achieve clustering results that outperform the state-of-the-art methods. Our Code in \href{https://github.com/Yuan-Jinghui/Riemannian-Optimization-on-Relaxed-Indicator-Matrix-Manifold}{here}.
中文摘要:作者提出了一种松弛指示器矩阵流形(RIM流形),实现了线性复杂度的黎曼优化,在聚类和图像去噪应用中超越了现有方法的性能。
English Summary: The authors introduce a relaxed indicator matrix manifold (RIM manifold) that enables efficient Riemannian optimization with linear complexity, outperforming existing methods in clustering and image denoising applications.

Authors:Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Abstract:
Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
中文: VPO是一个基于无害性、准确性和有益性三原则的提示优化框架,能忠实保留用户意图并显著提升生成视频的安全性和质量。
English: VPO is a principled framework that optimizes text prompts for video generation by ensuring harmlessness, accuracy, and helpfulness, significantly improving safety, alignment, and video quality across models.

Authors:Haoran Zheng, Renchi Yang, Jianliang Xu
Title: Adaptive Local Clustering over Attributed Graphs
Abstract:
Given a graph $G$ and a seed node $v_s$, the objective of local graph clustering (LGC) is to identify a subgraph $C_s \in G$ (a.k.a. local cluster) surrounding $v_s$ in time roughly linear with the size of $C_s$. This approach yields personalized clusters without needing to access the entire graph, which makes it highly suitable for numerous applications involving large graphs. However, most existing solutions merely rely on the topological connectivity between nodes in $G$, rendering them vulnerable to missing or noisy links that are commonly present in real-world graphs. To address this issue, this paper resorts to leveraging the complementary nature of graph topology and node attributes to enhance local clustering quality. To effectively exploit the attribute information, we first formulate the LGC as an estimation of the bidirectional diffusion distribution (BDD), which is specialized for capturing the multi-hop affinity between nodes in the presence of attributes. Furthermore, we propose LACA, an efficient and effective approach for LGC that achieves superb empirical performance on multiple real datasets while maintaining strong locality. The core components of LACA include (i) a fast and theoretically-grounded preprocessing technique for node attributes, (ii) an adaptive algorithm for diffusing any vectors over $G$ with rigorous theoretical guarantees and expedited convergence, and (iii) an effective three-step scheme for BDD approximation. Extensive experiments, comparing 17 competitors on 8 real datasets, show that LACA outperforms all competitors in terms of result quality measured against ground truth local clusters, while also being up to orders of magnitude faster. The code is available at https://github.com/HaoranZ99/alac.
中文: 本文提出LACA方法,通过结合图拓扑与节点属性来提升局部图聚类质量,在真实数据集上展现出卓越的聚类效果和运算效率。
English: This paper introduces LACA, a local graph clustering method that integrates both graph topology and node attributes to improve clustering quality and efficiency, demonstrating superior performance and speed on real datasets.

Authors:Vidya Sudevan, Fakhreddine Zayer, Rizwana Kausar, Sajid Javed, Hamad Karki, Giulia De Masi, Jorge Dias
Title: Underwater Image Enhancement by Convolutional Spiking Neural Networks
Abstract:
Underwater image enhancement (UIE) is fundamental for marine applications, including autonomous vision-based navigation. Deep learning methods using convolutional neural networks (CNN) and vision transformers advanced UIE performance. Recently, spiking neural networks (SNN) have gained attention for their lightweight design, energy efficiency, and scalability. This paper introduces UIE-SNN, the first SNN-based UIE algorithm to improve visibility of underwater images. UIE-SNN is a 19- layered convolutional spiking encoder-decoder framework with skip connections, directly trained using surrogate gradient-based backpropagation through time (BPTT) strategy. We explore and validate the influence of training datasets on energy reduction, a unique advantage of UIE-SNN architecture, in contrast to the conventional learning-based architectures, where energy consumption is model-dependent. UIE-SNN optimizes the loss function in latent space representation to reconstruct clear underwater images. Our algorithm performs on par with its non-spiking counterpart methods in terms of PSNR and structural similarity index (SSIM) at reduced timesteps ($T=5$) and energy consumption of $85\%$. The algorithm is trained on two publicly available benchmark datasets, UIEB and EUVP, and tested on unseen images from UIEB, EUVP, LSUI, U45, and our custom UIE dataset. The UIE-SNN algorithm achieves PSNR of \(17.7801~dB\) and SSIM of \(0.7454\) on UIEB, and PSNR of \(23.1725~dB\) and SSIM of \(0.7890\) on EUVP. UIE-SNN achieves this algorithmic performance with fewer operators (\(147.49\) GSOPs) and energy (\(0.1327~J\)) compared to its non-spiking counterpart (GFLOPs = \(218.88\) and Energy=\(1.0068~J\)). Compared with existing SOTA UIE methods, UIE-SNN achieves an average of \(6.5\times\) improvement in energy efficiency. The source code is available at \href{https://github.com/vidya-rejul/UIE-SNN.git}{UIE-SNN}.
Chinese: 本文提出了首个基于脉冲神经网络的水下图像增强算法UIE-SNN,在保持与非脉冲方法相当性能的同时,显著降低了85%的能耗并实现了6.5倍的能效提升。
English: This paper introduces UIE-SNN, the first spiking neural network-based underwater image enhancement algorithm that achieves comparable performance to non-spiking methods while significantly reducing energy consumption by 85% and improving energy efficiency by 6.5 times.

Authors:Hongda Liu, Longguang Wang, Weijun Guan, Ye Zhang, Yulan Guo
Title: Pluggable Style Representation Learning for Multi-Style Transfer
Abstract:
Due to the high diversity of image styles, the scalability to various styles plays a critical role in real-world applications. To accommodate a large amount of styles, previous multi-style transfer approaches rely on enlarging the model size while arbitrary-style transfer methods utilize heavy backbones. However, the additional computational cost introduced by more model parameters hinders these methods to be deployed on resource-limited devices. To address this challenge, in this paper, we develop a style transfer framework by decoupling the style modeling and transferring. Specifically, for style modeling, we propose a style representation learning scheme to encode the style information into a compact representation. Then, for style transferring, we develop a style-aware multi-style transfer network (SaMST) to adapt to diverse styles using pluggable style representations. In this way, our framework is able to accommodate diverse image styles in the learned style representations without introducing additional overhead during inference, thereby maintaining efficiency. Experiments show that our style representation can extract accurate style information. Moreover, qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance in terms of both accuracy and efficiency. The codes are available in https://github.com/The-Learning-And-Vision-Atelier-LAVA/SaMST.
Chinese: 本文提出了一种风格迁移框架,通过解耦风格建模与迁移过程,利用紧凑的风格表示和风格感知网络,在推理时不增加计算开销的前提下高效处理多样化的图像风格。
English: This paper introduces a style transfer framework that separates style modeling and transferring, using a compact style representation and a style-aware network to efficiently handle diverse image styles without increasing computational costs during inference.

Authors:Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo
Title: Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Abstract:
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. The source code will be released at https://github.com/aihao2000/DPN-LLaVA.
中文: 本文提出了一种动态金字塔网络(DPN),通过分层压缩多模态大语言模型中的视觉特征,在节省高达56%计算量的同时提升性能,并辅以动态池化专家机制根据输入难度自适应调整压缩率。
English: This paper introduces a dynamic pyramid network (DPN) that hierarchically compresses visual features in multimodal large language models to reduce computational costs by up to 56% while improving performance, complemented by a dynamic pooling experts mechanism that adapts compression rates based on input difficulty.

Authors:Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
Title: Wan: Open and Advanced Large-Scale Video Generative Models
Abstract:
This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.
中文: Wan是一套开源视频生成模型套件,提供1.3B和14B两种参数规模的模型,在保持消费级效率的同时实现了业界领先的性能表现,并支持涵盖八大下游任务的全面应用场景。
English: Wan is an open-source video generation suite featuring 1.3B and 14B parameter models that achieve leading performance across multiple benchmarks while maintaining consumer-grade efficiency and supporting eight downstream applications.

Authors:Jianyang Zhang, Qianli Luo, Guowu Yang, Wenjing Yang, Weide Liu, Guosheng Lin, Fengmao Lv
Title: Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
Abstract:
Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 9 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept sets are available at https://github.com/tiggers23/ALBM.
Chinese: 提出的属性形成语言瓶颈模型(ALBM)将概念组织在特定类别的属性空间中,以避免虚假线索推断并提升对未见类别的泛化能力,同时引入视觉属性提示学习和自动概念生成策略,以增强可解释性并减少标注工作量。
English: The proposed Attribute-formed Language Bottleneck Model (ALBM) organizes concepts in a class-specific attribute space to avoid spurious cue inference and improve generalization to unseen classes, while also introducing Visual Attribute Prompt Learning and an automated concept generation strategy for enhanced interpretability and reduced annotation effort.

Authors:Chenwei Zhang, Khanh Dao Duc
Title: CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets
Abstract:
Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at intermediate resolution (4-8 Å) is crucial in protein structure determination. Recent advances in deep learning have led to the development of automated approaches for enhancing experimental cryo-EM density maps. Yet, these methods are not optimized for intermediate-resolution maps and rely on map density features alone. To address this, we propose CryoSAMU, a novel method designed to enhance 3D cryo-EM density maps of protein structures using structure-aware multimodal U-Nets and trained on curated intermediate-resolution density maps. We comprehensively evaluate CryoSAMU across various metrics and demonstrate its competitive performance compared to state-of-the-art methods. Notably, CryoSAMU achieves significantly faster processing speed, showing promise for future practical applications. Our code is available at https://github.com/chenwei-zhang/CryoSAMU.
中文: CryoSAMU是一种新型深度学习方法,通过结构感知多模态U-Net增强中分辨率冷冻电镜密度图,相比现有方法具有更优性能和显著更快的处理速度。
English: CryoSAMU is a novel deep learning method that enhances intermediate-resolution cryo-EM density maps using structure-aware multimodal U-Nets, achieving competitive performance and significantly faster processing speeds compared to existing approaches.

Authors:Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, Lei Zhang
Title: InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
Abstract:
Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at \href{https://github.com/langmanbusi/InsViE}{InsViE}.
中文:本研究提出了InsViE-1M高质量指令视频编辑数据集,通过精心设计的编辑过滤流程和多阶段训练策略,有效解决了现有数据集质量不足的问题,显著提升了模型的编辑性能。
English: This work introduces InsViE-1M, a high-quality dataset for instruction-based video editing that overcomes limitations of existing datasets through a rigorous editing-filtering pipeline and multi-stage training, significantly improving model performance.

Authors:Zhenyu Liang, Hao Li, Naiwei Yu, Kebin Sun, Ran Cheng
Title: Bridging Evolutionary Multiobjective Optimization and GPU Acceleration via Tensorization
Abstract:
Evolutionary multiobjective optimization (EMO) has made significant strides over the past two decades. However, as problem scales and complexities increase, traditional EMO algorithms face substantial performance limitations due to insufficient parallelism and scalability. While most work has focused on algorithm design to address these challenges, little attention has been given to hardware acceleration, thereby leaving a clear gap between EMO algorithms and advanced computing devices, such as GPUs. To bridge the gap, we propose to parallelize EMO algorithms on GPUs via the tensorization methodology. By employing tensorization, the data structures and operations of EMO algorithms are transformed into concise tensor representations, which seamlessly enables automatic utilization of GPU computing. We demonstrate the effectiveness of our approach by applying it to three representative EMO algorithms: NSGA-III, MOEA/D, and HypE. To comprehensively assess our methodology, we introduce a multiobjective robot control benchmark using a GPU-accelerated physics engine. Our experiments show that the tensorized EMO algorithms achieve speedups of up to 1113x compared to their CPU-based counterparts, while maintaining solution quality and effectively scaling population sizes to hundreds of thousands. Furthermore, the tensorized EMO algorithms efficiently tackle complex multiobjective robot control tasks, producing high-quality solutions with diverse behaviors. Source codes are available at https://github.com/EMI-Group/evomo.
中文摘要:该研究通过张量化方法在GPU上并行化进化多目标优化算法,实现了高达1113倍的加速,同时保持解的质量,并能有效处理大规模多目标机器人控制等复杂任务。
English Summary: The study introduces tensorization to parallelize evolutionary multiobjective optimization algorithms on GPUs, achieving up to 1113x speedup while maintaining solution quality and enabling large-scale applications like robot control.

Authors:Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, Shi Guo
Title: EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation
Abstract:
Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: https://github.com/OpenImagingLab/EGVD.
中文: 事件引导视频扩散模型(EGVD)通过融合预训练视频扩散模型与事件相机数据,有效解决了大运动场景下的视频帧插值难题,在多种场景中实现了卓越的感知质量和泛化能力。
English: The Event-Guided Video Diffusion Model (EGVD) effectively integrates pre-trained video diffusion models with event camera data to address large motion challenges in video frame interpolation, achieving superior perceptual quality and generalization across diverse scenarios.

Authors:Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Title: Revisit Time Series Classification Benchmark: The Impact of Temporal Information for Classification
Abstract:
Time series classification is usually regarded as a distinct task from tabular data classification due to the importance of temporal information. However, in this paper, by performing permutation tests that disrupt temporal information on the UCR time series classification archive, the most widely used benchmark for time series classification, we identify a significant proportion of datasets where temporal information has little to no impact on classification. Many of these datasets are tabular in nature or rely mainly on tabular features, leading to potentially biased evaluations of time series classifiers focused on temporal information. To address this, we propose UCR Augmented, a benchmark based on the UCR time series classification archive designed to evaluate classifiers' ability to extract and utilize temporal information. Testing classifiers from seven categories on this benchmark revealed notable shifts in performance rankings. Some previously overlooked approaches perform well, while others see their performance decline significantly when temporal information is crucial. UCR Augmented provides a more robust framework for assessing time series classifiers, ensuring fairer evaluations. Our code is available at https://github.com/YunruiZhang/Revisit-Time-Series-Classification-Benchmark.
中文: 本研究揭示UCR时间序列数据集中许多数据集无需依赖时序信息,提出UCR Augmented基准以公平评估分类器的时序特征提取能力,并观察到性能排名的显著变化。
English: This study reveals that temporal information is non-essential for many UCR time series datasets, proposing UCR Augmented to fairly evaluate classifiers' temporal feature utilization and observing significant performance ranking changes.

Authors:Taorui Wang, Zitong Yu, Yong Xu
Title: TC-GS: Tri-plane based compression for 3D Gaussian Splatting
Abstract:
Recently, 3D Gaussian Splatting (3DGS) has emerged as a prominent framework for novel view synthesis, providing high fidelity and rapid rendering speed. However, the substantial data volume of 3DGS and its attributes impede its practical utility, requiring compression techniques for reducing memory cost. Nevertheless, the unorganized shape of 3DGS leads to difficulties in compression. To formulate unstructured attributes into normative distribution, we propose a well-structured tri-plane to encode Gaussian attributes, leveraging the distribution of attributes for compression. To exploit the correlations among adjacent Gaussians, K-Nearest Neighbors (KNN) is used when decoding Gaussian distribution from the Tri-plane. We also introduce Gaussian position information as a prior of the position-sensitive decoder. Additionally, we incorporate an adaptive wavelet loss, aiming to focus on the high-frequency details as iterations increase. Our approach has achieved results that are comparable to or surpass that of SOTA 3D Gaussians Splatting compression work in extensive experiments across multiple datasets. The codes are released at https://github.com/timwang2001/TC-GS.
中文摘要:该研究提出了一种三平面编码方法,通过结构化高斯属性并利用KNN解码和位置先验来压缩3D高斯泼溅数据,在多个数据集上实现了领先的压缩性能。
English Summary: The study introduces a tri-plane encoding method for compressing 3D Gaussian Splatting data by structuring attributes and utilizing KNN decoding with positional priors, achieving state-of-the-art compression performance across multiple datasets.

Authors:Zhouhong Gu, Xingzhou Chen, Xiaoran Shi, Tao Wang, Suhang Zheng, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
Abstract:
Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in https://github.com/MikeGu721/GAPO.
Chinese: 摘要介绍了生成对抗策略优化(GAPO),这是一种通过对抗性训练和仅编码器奖励模型来增强大语言模型处理细粒度约束能力的新框架,在多个基准测试中显著优于PPO和DPO等现有方法。
English: The abstract introduces Generative Adversarial Policy Optimization (GAPO), a novel framework that enhances large language models' ability to handle fine-grained constraints through adversarial training and an encoder-only reward model, outperforming existing methods like PPO and DPO in benchmarks.

Authors:Huanhuan Ma, Haisong Gong, Xiaoyuan Yi, Xing Xie, Dongkuan Xu
Title: Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs
Abstract:
Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: https://github.com/dependentsign/CSI.
中文摘要:本研究开发了核心情感量表(CSI),这是一种专为大型语言模型设计的双语心理评估工具,能够通过乐观、悲观和中立三个维度可靠评估模型情感倾向,相比现有方法显著提升了评估可靠性,并与实际输出情感呈现超过0.85的相关性。
English Summary: This study introduces the Core Sentiment Inventory (CSI), a bilingual psychological evaluation tool designed to reliably assess Large Language Models' emotional tendencies across optimism, pessimism, and neutrality dimensions, demonstrating superior reliability and over 0.85 correlation with real-world outputs compared to existing methods.

Authors:Naitik Jain, Amogh Joshi, Mason Earles
Title: iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species
Abstract:
Accurate identification of crop and weed species is critical for precision agriculture and sustainable farming. However, it remains a challenging task due to a variety of factors -- a high degree of visual similarity among species, environmental variability, and a continued lack of large, agriculture-specific image data. We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species, with precise annotations along the taxonomic hierarchy from binary crop/weed labels to specific species labels. Curated from the broader iNaturalist database, iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments. Enabled by this data, we train benchmark models built upon the Swin Transformer architecture and evaluate the impact of various modifications such as the incorporation of geospatial data and LoRA finetuning. Our best models achieve state-of-the-art performance across all taxonomic classification tasks, achieving 92.38\% on crop and weed classification. Furthermore, the scale of our dataset enables us to explore incorrect misclassifications and unlock new analytic possiblities for plant species. By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems. We release the iNatAg dataset publicly through AgML (https://github.com/Project-AgML/AgML), enabling direct access and integration into agricultural machine learning workflows.
中文: iNatAg数据集包含超过470万张图像,涵盖2959种作物和杂草,通过多层次标注和地理多样性,为精准农业提供了强大的分类基础,解决了物种视觉相似性和环境多变性的挑战。
English: The iNatAg dataset, comprising over 4.7 million images of 2,959 crop and weed species with multi-level annotations, enables state-of-the-art classification models and provides a robust foundation for precision agriculture by addressing challenges like visual similarity and environmental variability.

Authors:Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao
Title: Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
Abstract:
Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.
中文:Med3DVLM通过高效编码器、对比学习策略和多模态融合三大创新,解决了三维医学影像与文本对齐的难题,在多项基准测试中实现了领先性能。
English: Med3DVLM overcomes the computational and alignment challenges of 3D vision-language models through innovations in efficient encoding, contrastive learning, and multi-modal fusion, achieving state-of-the-art performance across multiple medical imaging benchmarks.

Authors:Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang
Title: Audio-centric Video Understanding Benchmark without Text Shortcut
Abstract:
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.
中文摘要:本文提出以音频为中心的视频理解基准(AVUT),通过设计音频导向任务和答案重排过滤机制,专门评估多模态大语言模型对听觉信息及音视频交互的理解能力。
English Summary: This paper introduces an audio-centric video understanding benchmark (AVUT) to evaluate multimodal LLMs' video comprehension by focusing on auditory information and addressing text shortcut issues through answer permutation filtering.

Authors:Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang
Title: Audio-centric Video Understanding Benchmark without Text Shortcut
Abstract:
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.
中文摘要:本文提出以音频为中心的视频理解基准(AVUT),通过设计音频导向任务和答案重排过滤机制,专门评估多模态大语言模型对听觉信息及音视频交互的理解能力。
English Summary: This paper introduces an audio-centric video understanding benchmark (AVUT) to evaluate multimodal LLMs' video comprehension by focusing on auditory information and addressing text shortcut issues through answer permutation filtering.

Authors:Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
Title: LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
Abstract:
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.
中文: LogQuant是一种创新的2位KV缓存量化技术,可在保持高性能的同时大幅降低大语言模型推理的内存占用,在同等压缩率下将复杂任务的准确率提升40%至200%,并提高吞吐量25%。
English: LogQuant is an innovative 2-bit KV Cache quantization method that significantly reduces memory usage in LLM inference while maintaining high performance, achieving up to 25% throughput improvement and 40-200% accuracy gains on complex tasks.

Authors:Daniel G. P. Petrini, Hae Yong Kim
Title: Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification
Abstract:
Mammography, an X-ray-based imaging technique, plays a crucial role in the early detection of breast cancer. Its accuracy heavily depends on expert radiologists, making it essential to minimize interpretation errors. To support radiologists, various computer-aided detection and diagnostic methods have been proposed, increasingly leveraging advancements in artificial intelligence and machine learning. Over recent years, mammogram analysis has evolved significantly - from early patch-based classifiers, which examine only localized regions of images, to full-image classifiers, and later towards multi-view systems that simultaneously integrate multiple perspectives of the mammographic exam for enhanced accuracy. Despite this progression, critical questions remain, such as whether multi-view systems consistently outperform single-view approaches. In this paper, we systematically evaluate and compare the effectiveness of single-view and multi-view mammogram classification techniques. Our research introduces models that achieve superior performance relative to existing state-of-the-art approaches in both single-view and two-view classification scenarios. Furthermore, our findings provide valuable insights into optimal model architectures and effective transfer learning strategies, paving the way for more accurate and efficient mammogram interpretation. The inference code and model are available at https://github.com/dpetrini/multiple-view.
中文: 本研究通过系统评估人工智能驱动的乳腺摄影分析,证明了多视角架构显著优于单视角方法,并通过原则性模型设计确立了新的最先进基准,从而推进了乳腺癌筛查技术。
English: This study advances breast cancer screening by systematically evaluating AI-driven mammography analysis, demonstrating that multi-view architectures significantly outperform single-view methods and establishing new state-of-the-art benchmarks through principled model design.

Authors:Daniel G. P. Petrini, Hae Yong Kim
Title: Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification
Abstract:
Mammography, an X-ray-based imaging technique, remains central to the early detection of breast cancer. Recent advances in artificial intelligence have enabled increasingly sophisticated computer-aided diagnostic methods, evolving from patch-based classifiers to whole-image approaches and then to multi-view architectures that jointly analyze complementary projections. Despite this progress, several critical questions remain unanswered. In this study, we systematically investigate these issues by addressing five key research questions: (1) the role of patch classifiers in performance, (2) the transferability of natural-image-trained backbones, (3) the advantages of learn-to-resize over conventional downscaling, (4) the contribution of multi-view integration, and (5) the robustness of findings across varying image quality. Beyond benchmarking, our experiments demonstrate clear performance gains over prior work. For the CBIS-DDSM dataset, we improved single-view AUC from 0.8153 to 0.8343, and multiple-view AUC from 0.8483 to 0.8658. Using a new comparative method, we also observed a 0.0217 AUC increase when extending from single to multiple-view analysis. On the complete VinDr-Mammo dataset, the multiple-view approach further improved results, achieving a 0.0492 AUC increase over single view and reaching 0.8511 AUC overall. These results establish new state-of-the-art benchmarks, providing clear evidence of the advantages of multi-view architectures for mammogram interpretation. Beyond performance, our analysis offers principled insights into model design and transfer learning strategies, contributing to the development of more accurate and reliable breast cancer screening tools. The inference code and trained models are publicly available at https://github.com/dpetrini/multiple-view.
中文: 本研究通过系统评估人工智能驱动的乳腺摄影分析,证明了多视角架构显著优于单视角方法,并通过原则性模型设计确立了新的最先进基准,从而推进了乳腺癌筛查技术。
English: This study advances breast cancer screening by systematically evaluating AI-driven mammography analysis, demonstrating that multi-view architectures significantly outperform single-view methods and establishing new state-of-the-art benchmarks through principled model design.

Authors:Mengqi Lou, Kabir Aladin Verchand, Sara Fridovich-Keil, Ashwin Pananjady
Title: Accurate, provable, and fast nonlinear tomographic reconstruction: A variational inequality approach
Abstract:
We consider the problem of signal reconstruction for computed tomography (CT) under a nonlinear forward model that accounts for exponential signal attenuation, a polychromatic X-ray source, general measurement noise (e.g. Poisson shot noise), and observations acquired over multiple wavelength windows. We develop a simple iterative algorithm for single-material reconstruction, which we call EXACT (EXtragradient Algorithm for Computed Tomography), based on formulating our estimate as the fixed point of a monotone variational inequality. We prove guarantees on the statistical and computational performance of EXACT under practical assumptions on the measurement process. We also consider a recently introduced variant of this model with Gaussian measurements, and present sample and iteration complexity bounds for EXACT that improve upon those of existing algorithms. We apply our EXACT algorithm to a CT phantom image recovery task and show that it often requires fewer X-ray projection exposures, lower source intensity, and less computation time to achieve similar reconstruction quality to existing methods.
中文摘要:EXACT算法针对非线性前向模型的计算机断层扫描重建问题,相比现有方法能以更少的X射线投影曝光、更低辐射剂量和更短计算时间实现相近重建质量。
English Summary: The EXACT algorithm is developed for computed tomography reconstruction under a nonlinear forward model, offering improved statistical and computational performance with fewer exposures and less computation time compared to existing methods.

Authors:Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu
Title: SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining
Abstract:
LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow
中文摘要:SuperFlow++ 是一种新颖的激光雷达表示学习框架,通过整合连续激光雷达-摄像头对的时空线索,在11个数据集上超越现有最优方法,为自动驾驶感知建立了新基准。
English Summary: SuperFlow++ is a novel LiDAR representation learning framework that integrates spatiotemporal cues using consecutive LiDAR-camera pairs, outperforming state-of-the-art methods across 11 datasets while establishing new benchmarks for autonomous driving perception.

Authors:Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M. Rehg, Yapeng Tian
Title: Towards Online Multi-Modal Social Interaction Understanding
Abstract:
Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: https://github.com/Sampson-Lee/OnlineMMSI.
Chinese: 本研究提出了Online-MMSI-VLM框架,通过预测未来对话和整合社交感知视觉线索,在无需未来上下文的情况下实现了实时多模态社交交互理解的卓越性能。
English: The study introduces Online-MMSI-VLM, a framework that enhances real-time multimodal social interaction understanding by forecasting future conversations and integrating social-aware visual cues, achieving superior performance without relying on future context.

Authors:Aaron Serianni, Tyler Zhu, Olga Russakovsky, Vikram V. Ramaswamy
Title: Attention IoU: Examining Biases in CelebA using Attention Maps
Abstract:
Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.
Chinese: Attention-IoU指标通过注意力图揭示模型内部表征中的偏见并识别导致偏见的图像特征,在Waterbirds和CelebA数据集上的验证表明,它能发现超出准确率差异的相关性和数据标签中未体现的潜在混杂变量。
English: The Attention-IoU metric uses attention maps to uncover biases in a model's internal representations and identify image features causing them, validated on datasets like Waterbirds and CelebA to reveal correlations and confounding variables beyond accuracy disparities.

Authors:Matthew Greenig, Haowen Zhao, Vladimir Radenkovic, Aubin Ramon, Pietro Sormanni
Title: IgCraft: A versatile sequence generation framework for antibody discovery and engineering
Abstract:
Designing antibody sequences to better resemble those observed in natural human repertoires is a key challenge in biologics development. We introduce IgCraft: a multi-purpose model for paired human antibody sequence generation, built on Bayesian Flow Networks. IgCraft presents one of the first unified generative modeling frameworks capable of addressing multiple antibody sequence design tasks with a single model, including unconditional sampling, sequence inpainting, inverse folding, and CDR motif scaffolding. Our approach achieves competitive results across the full spectrum of these tasks while constraining generation to the space of human antibody sequences, exhibiting particular strengths in CDR motif scaffolding (grafting) where we achieve state-of-the-art performance in terms of humanness and preservation of structural properties. By integrating previously separate tasks into a single scalable generative model, IgCraft provides a versatile platform for sampling human antibody sequences under a variety of contexts relevant to antibody discovery and engineering. Model code and weights are publicly available at https://github.com/mgreenig/IgCraft.
中文: IgCraft 提出了一种基于贝叶斯流网络的多功能生成模型,能够统一执行多种抗体序列设计任务,尤其在CDR基序支架方面表现出色,确保了序列的人源化和结构特性的保留。
English: IgCraft introduces a unified generative model using Bayesian Flow Networks to design human-like antibody sequences, excelling in tasks like CDR motif scaffolding while ensuring structural integrity and naturalness.

Authors:Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
Title: PAVE: Patching and Adapting Video Large Language Models
Abstract:
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.
Chinese: PAVE是一种灵活的框架,通过为预训练的视频大语言模型添加轻量级适配器,使其能够处理包含音频和3D线索等辅助信号的任务,在显著提升性能的同时仅增加极少的计算开销。
English: PAVE is a flexible framework that enhances pre-trained Video LLMs by adding lightweight adapters to handle tasks with side-channel signals like audio and 3D cues, significantly improving performance with minimal computational overhead.

Authors:Jingdan Kang, Haoxin Yang, Yan Cai, Huaidong Zhang, Xuemiao Xu, Yong Du, Shengfeng He
Title: SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation
Abstract:
Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method's robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at https://github.com/A-raniy-day/SITA.
中文: 提出的SITA方法通过破坏图像风格表征并在结构细节中嵌入噪声,有效保护艺术作品免遭未经授权的风格化生成使用,相比现有方法具有更高的可迁移性、计算效率和视觉不可感知性。
English: The proposed SITA method protects artworks from unauthorized use in stylized image generation by disrupting style representations through structurally embedded noise, offering enhanced transferability, computational efficiency, and visual imperceptibility compared to existing approaches.

Authors:Vladan Stojnić, Yannis Kalantidis, Jiří Matas, Giorgos Tolias
Title: LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
Abstract:
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
Chinese: LPOSS+ 提出了一种无需训练的开集词汇语义分割方法,通过像素级标签传播优化视觉语言模型的预测结果,并利用视觉模型捕捉模态内关系,在不依赖窗口处理的情况下实现了最先进的性能。
English: LPOSS+ introduces a training-free method for open-vocabulary semantic segmentation by refining Vision-and-Language Model predictions through pixel-level label propagation and leveraging Vision Models to capture intra-modal relationships, achieving state-of-the-art accuracy without window-based processing.

Authors:Pihai Sun, Junjun Jiang, Yuanqi Yao, Youyu Chen, Wenbo Zhao, Kui Jiang, Xianming Liu
Title: FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion
Abstract:
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE
中文摘要:FUSE框架通过频率解耦融合的自监督编码器,解决了图像-事件联合深度估计中数据稀缺和模态失配的双重挑战,在实现最优性能的同时展现出卓越的零样本适应能力。
English Summary: The FUSE framework introduces a self-supervised encoder with frequency-decoupled fusion to overcome data scarcity and modality mismatch in image-event depth estimation, achieving state-of-the-art performance and strong zero-shot adaptability.

Authors:Yuli Zhou, Guolei Sun, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu
Title: CamSAM2: Segment Anything Accurately in Camouflaged Videos
Abstract:
Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at https://github.com/zhoustan/CamSAM2.
中文: 本文提出Camouflaged SAM2 (CamSAM2)方法,通过引入少量可学习参数增强SAM2在视频中分割伪装物体的能力,在多个VCOS数据集上实现了显著性能提升。
English: This paper introduces Camouflaged SAM2 (CamSAM2), a method that enhances SAM2's ability to segment camouflaged objects in videos by adding minimal learnable parameters, achieving significant performance improvements on VCOS datasets.

Authors:Yusen Xie, Zhengmin Huang, Shaojie Shen, Jun Ma
Title: Semi-SMD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving
Abstract:
In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code will be available on https://github.com/xieyuser/Semi-SMD.
Chinese: 本文提出Semi-SMD,一种面向自动驾驶环视相机的创新深度估计框架,通过时空语义融合和跨注意力机制,在基准数据集上实现了最先进的性能表现。
English: This paper presents Semi-SMD, a novel framework for metric depth estimation in autonomous driving that integrates spatial-temporal-semantic fusion and cross-attention mechanisms to achieve state-of-the-art performance on benchmark datasets.

Authors:Ilias Stogiannidis, Steven McDonagh, Sotirios A. Tsaftaris
Title: Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (https://github.com/stogiannidis/srbench) and dataset available on HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench).
Chinese: 视觉语言模型在空间推理方面存在显著不足,其表现接近随机猜测水平,凸显了推进该领域发展的迫切需求。
English: Vision-Language Models show significant limitations in spatial reasoning, with their performance averaging near random chance, highlighting an urgent need for advancements in this area.

Authors:Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Title: Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
Abstract:
View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.
Chinese: 提出的BYOV方法通过掩码式自-他视角建模,从未配对的第一人称和第三人称视频中学习细粒度的视角不变视频表征,在多项下游任务中均显著超越现有方法。
English: The proposed Bootstrap Your Own Views (BYOV) method uses masked ego-exo modeling to learn fine-grained view-invariant video representations from unpaired first- and third-person videos, achieving superior performance across multiple downstream tasks.

Authors:Andrii Yermakov, Jan Cech, Jiri Matas
Title: Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection
Abstract:
This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection
中文: 本文提出了一种基于CLIP视觉编码器的深度伪造检测方法,通过最小化模型调整实现了跨数据集的优异检测性能,为可泛化检测领域建立了简单而强大的基准。
English: This paper introduces a robust deepfake detection method using CLIP's visual encoder with minimal modifications, achieving competitive accuracy across diverse datasets and establishing a simple yet effective baseline for generalizable detection.

Authors:Jan Kohút, Martin Dočekal, Michal Hradiš, Marek Vaško
Title: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
Abstract:
Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset
Chinese: BiblioPage数据集通过提供2000份带注释的专著扉页,解决了书目元数据提取领域专用资源匮乏的问题,评估显示基于Transformer的OCR和视觉大模型最高F1值达67%,为文档理解任务提供了基准。
English: The BiblioPage dataset addresses the lack of dedicated resources for bibliographic metadata extraction by providing 2,000 annotated monograph title pages, with evaluations showing transformer-based OCR and visual language models achieving up to 67 F1 score, serving as a benchmark for document understanding tasks.

Authors:Yabin Wang, Zhiwu Huang, Xiaopeng Hong
Title: OpenSDI: Spotting Diffusion-Generated Images in the Open World
Abstract:
This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.
中文摘要:本文提出了检测扩散生成图像的OpenSDI挑战,并创建了OpenSDID基准数据集,通过协同预训练模型(SPM)方案开发出MaskCLIP模型,在定位和检测任务中均取得了最先进的性能表现。
English Summary: This paper introduces the OpenSDI challenge for detecting diffusion-generated images and proposes the OpenSDID benchmark with a novel Synergizing Pretrained Models (SPM) scheme, leading to the MaskCLIP model that achieves state-of-the-art performance in both localization and detection tasks.

Authors:Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, Qingming Huang
Title: Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Abstract:
The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.
中文: 本文提出了HAVEN基准来评估大模型在视频理解中的幻觉问题,并通过监督推理微调和直接偏好优化的视频思维模型有效降低了幻觉,显著提升了准确率并减少了偏差。
English: This paper introduces HAVEN, a benchmark to evaluate hallucinations in large multimodal models for video understanding, and proposes a video-thinking model that significantly reduces hallucinations through supervised reasoning fine-tuning and direct preference optimization.

Authors:Xinxing Cheng, Tianyang Zhang, Wenqi Lu, Qingjie Meng, Alejandro F. Frangi, Jinming Duan
Title: SACB-Net: Spatial-awareness Convolutions for Medical Image Registration
Abstract:
Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds. Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code is available at https://github.com/x-xc/SACB_Net .
Chinese: 本文提出3D空间感知卷积块(SACB)和SACB-Net,通过自适应生成不同区域的卷积核来解决图像配准中空间变化信息捕捉不足的问题,在脑部和腹部数据集上展现了优越性能。
English: This paper introduces a 3D Spatial-Awareness Convolution Block (SACB) and SACB-Net to address limitations in capturing spatially varying information in image registration by adaptively generating convolution kernels for different regions, demonstrating superior performance on brain and abdomen datasets.

Authors:Mohammad Daffa Robani, Paul Saves, Pramudita Satria Palar, Lavi Rizki Zuhal, oseph Morlier
Title: SMT-EX: An Explainable Surrogate Modeling Toolbox for Mixed-Variables Design Exploration
Abstract:
Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: https://github.com/SMTorg/smt-explainability .
中文: 本文介绍了SMT-EX,作为Python代理建模工具箱的增强版本,它集成了三种关键可解释性方法,通过用户友好的分析工具帮助工程师理解复杂系统。
English: This paper introduces SMT-EX, an enhanced version of the Python Surrogate Modeling Toolbox that integrates three key explainability methods to help engineers understand complex systems through user-friendly analysis tools.

Authors:Jiaxin Zhang, Junjun Jiang, Youyu Chen, Kui Jiang, Xianming Liu
Title: COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting
Abstract:
Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained model, yielding clear boundaries while preserving high visual quality. Code is available at https://github.com/ZestfulJX/COB-GS.
中文: COB-GS通过联合优化语义与视觉信息,采用边界自适应高斯分割和纹理校正技术,在保持高质量视觉效果的同时,显著提升了3D高斯溅射分割中对象边界的精确度。
English: COB-GS enhances 3D Gaussian Splatting segmentation by jointly optimizing semantic and visual information, using boundary-adaptive Gaussian splitting and texture rectification to achieve precise object boundaries and maintain high visual quality.

Authors:Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Qi Zhao, Changyu Zeng, Wenpei Bai, Guangliang Cheng
Title: ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation
Abstract:
Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba's scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at https://github.com/BaoBao0926/ASP-VMUNet/tree/main.
中文: 本文提出ASP-VMUNet新型皮肤病变分割框架,通过结合高效Mamba架构与空洞扫描及并行视觉层,克服传统CNN和Transformer的局限性,在多个基准数据集上实现了卓越性能。
English: This paper introduces ASP-VMUNet, a novel skin lesion segmentation framework that combines the efficient Mamba architecture with atrous scanning and parallel vision layers to overcome limitations of CNNs and Transformers, achieving superior performance on benchmark datasets.

Authors:Kian Kai Ang, Damith C. Ranasinghe
Title: QUIC-Fuzz: An Effective Greybox Fuzzer For The QUIC Protocol
Abstract:
Network applications are routinely under attack. We consider the problem of developing an effective and efficient fuzzer for the recently ratified QUIC network protocol to uncover security vulnerabilities. QUIC offers a unified transport layer for low latency, reliable transport streams that is inherently secure, ultimately representing a complex protocol design characterised by new features and capabilities for the Internet. Fuzzing a secure transport layer protocol is not trivial. The interactive, strict, rule-based, asynchronous nature of communications with a target, the stateful nature of interactions, security mechanisms to protect communications (such as integrity checks and encryption), and inherent overheads (such as target initialisation) challenge generic network protocol fuzzers. We discuss and address the challenges pertinent to fuzzing transport layer protocols (like QUIC), developing mechanisms that enable fast, effective fuzz testing of QUIC implementations to build a prototype grey-box mutation-based fuzzer; QUIC-Fuzz. We test 6, well-maintained server-side implementations, including from Google and Alibaba with QUIC-Fuzz. The results demonstrate the fuzzer is both highly effective and generalisable. Our testing uncovered 10 new security vulnerabilities, precipitating 2 CVE assignments thus far. In code coverage, QUIC-Fuzz outperforms other existing state-of-the-art network protocol fuzzers (Fuzztruction-Net, ChatAFL, and ALFNet) with up to an 84% increase in code coverage where QUIC-Fuzz outperformed statistically significantly across all targets and with a majority of bugs only discoverable by QUIC-Fuzz. We open-source QUIC-Fuzz on GitHub.
中文: 研究人员开发了QUIC-Fuzz,这是一种专门针对QUIC协议的灰盒变异模糊测试工具,能高效发现实现中的安全漏洞,其代码覆盖率显著优于现有工具,并已发现10个新漏洞。
English: Researchers developed QUIC-Fuzz, a specialized grey-box mutation-based fuzzer that effectively uncovers security vulnerabilities in QUIC protocol implementations, outperforming existing fuzzers with significantly higher code coverage and discovering 10 new vulnerabilities.

Authors:Chenghao Li, Razvan Beuran, Nak Young Chong
Title: Quality-focused Active Adversarial Policy for Safe Grasping in Human-Robot Interaction
Abstract:
Vision-guided robot grasping methods based on Deep Neural Networks (DNNs) have achieved remarkable success in handling unknown objects, attributable to their powerful generalizability. However, these methods with this generalizability tend to recognize the human hand and its adjacent objects as graspable targets, compromising safety during Human-Robot Interaction (HRI). In this work, we propose the Quality-focused Active Adversarial Policy (QFAAP) to solve this problem. Specifically, the first part is the Adversarial Quality Patch (AQP), wherein we design the adversarial quality patch loss and leverage the grasp dataset to optimize a patch with high quality scores. Next, we construct the Projected Quality Gradient Descent (PQGD) and integrate it with the AQP, which contains only the hand region within each real-time frame, endowing the AQP with fast adaptability to the human hand shape. Through AQP and PQGD, the hand can be actively adversarial with the surrounding objects, lowering their quality scores. Therefore, further setting the quality score of the hand to zero will reduce the grasping priority of both the hand and its adjacent objects, enabling the robot to grasp other objects away from the hand without emergency stops. We conduct extensive experiments on the benchmark datasets and a cobot, showing the effectiveness of QFAAP. Our code and demo videos are available here: https://github.com/clee-jaist/QFAAP.
中文摘要:本研究提出的质量导向主动对抗策略(QFAAP)通过对抗质量补丁和投影质量梯度下降法,有效降低机器人对人类手部及邻近物体的抓取优先级,从而保障人机交互安全。
English Summary: The proposed Quality-focused Active Adversarial Policy (QFAAP) addresses safety risks in human-robot interaction by using an adversarial quality patch and gradient descent method to reduce grasping priority for human hands and nearby objects.

Authors:Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
Title: EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
Abstract:
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.
中文:提出的EfficientMT框架通过利用合成配对样本和新颖模块,高效地将预训练文本到视频模型适配为通用视频运动迁移框架,无需测试时优化,在效率和运动可控性方面优于现有方法。
English: The proposed EfficientMT framework efficiently adapts a pretrained text-to-video model for general video motion transfer by leveraging synthetic paired samples and novel modules, eliminating the need for test-time optimization while outperforming existing methods in efficiency and motion controllability.

Authors:Zizhi Chen, Minghao Han, Xukun Zhang, Shuwei Ma, Tao Liu, Xing Wei, Lihua Zhang
Title: VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction
Abstract:
Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is https://github.com/CZZZZZZZZZZZZZZZZZ/VGAT.
中文总结:VGAT框架通过视觉问答技术重构基因组信息并增强病理图像的判别区域,仅利用全切片图像即可实现精准的癌症生存预测,为资源受限环境提供了无需基因组测序的临床解决方案。
English Summary: The VGAT framework enables accurate cancer survival prediction using only pathology images by reconstructing genomic information through visual question answering techniques and enhancing discriminative image regions, eliminating the need for genomic sequencing in resource-limited settings.

Authors:Farzad Beizaee, Gregory A. Lodygensky, Christian Desrosiers, Jose Dolz
Title: Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
Abstract:
Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed Deviation correction diffusion (DeCo-Diff) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well known anomaly detection datasets. The code is available at https://github.com/farzad-bz/DeCo-Diff
中文摘要:本文提出的DeCo-Diff模型通过将异常建模为潜在空间噪声,实现了对异常区域的精准选择性重建,在无监督异常检测与定位任务中取得了显著性能提升。
English Summary: This paper introduces DeCo-Diff, a novel diffusion model that selectively reconstructs only anomalous image regions by treating anomalies as latent noise, achieving significant improvements in unsupervised anomaly detection and localization.

Authors:Yuxuan Hu, Xiaodong Chen, Cuiping Li, Hong Chen, Jing Zhang
Title: QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition
Abstract:
Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.
中文摘要:QUAD框架利用奇异值分解抑制激活异常值,通过参数高效微调实现中型大语言模型的高效4位量化,同时保持高精度。
English Summary: The QUAD framework uses Singular Value Decomposition to suppress activation outliers, enabling efficient 4-bit quantization of medium-sized LLMs while maintaining high accuracy through parameter-efficient fine-tuning.

Authors:Yongting Hu, Yuxin Lin, Chengliang Liu, Xiaoling Luo, Xiaoyan Dou, Qihao Xu, Yong Xu
Title: Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection
Abstract:
Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on https://github.com/HuYongting/WGLIN.
中文: 本研究提出了一种新颖的双分支网络,结合小波变换和跨视图融合技术,以增强多视角糖尿病视网膜病变检测中的病灶特征学习并减少冗余,在公开数据集上验证了其优越性能。
English: This study introduces a novel two-branch network with wavelet transform and cross-view fusion to enhance lesion feature learning and reduce redundancy in multi-view diabetic retinopathy detection, demonstrating superior performance on public datasets.

Authors:Weizhi Chen, Jingbo Chen, Yupeng Deng, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng
Title: LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
Abstract:
This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.
中文: 本研究提出LRSCLIP新型视觉语言基础模型和LRS2M多模态数据集,解决了遥感领域长文本处理瓶颈和语义粒度限制问题,在跨模态检索与分类任务中实现了最优性能。
English: This study introduces LRSCLIP, a novel vision-language foundation model, and the LRS2M dataset to overcome long text processing bottlenecks and reduce hallucinations in remote sensing, achieving state-of-the-art performance in cross-modal retrieval and classification tasks.

Authors:Zhuoran Zhao, Linlin Yang, Pengzhan Sun, Pan Hui, Angela Yao
Title: Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
Abstract:
Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: https://github.com/delaprada/HandSynthesis.git.
Chinese: 本文系统研究了3D手部姿态估计中的合成与真实数据差距,识别关键因素并提出数据合成流程,使合成数据达到真实数据精度,为仅使用合成数据铺平道路。
English: This paper systematically investigates the synthetic-to-real gap in 3D hand pose estimation, identifying key factors and proposing a data synthesis pipeline that enables synthetic data to match real data accuracy, paving the way for using synthetic data alone.

Authors:Yang Ren, Hai Jiang, Menglong Yang, Wei Li, Shuaicheng Liu
Title: ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency
Abstract:
RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at https://github.com/RenYangSCU/ISPDiffuser.
中文: 本文提出ISPDiffuser框架,通过将RAW到sRGB转换解耦为灰度空间细节重建和色彩一致性映射,采用纹理感知扩散模型和直方图引导的色彩模块,在细节恢复和色彩保真度上均优于现有方法。
English: This paper introduces ISPDiffuser, a diffusion-based framework that decouples RAW-to-sRGB conversion into grayscale detail reconstruction and color mapping, utilizing a texture-aware diffusion model and histogram-guided color module to outperform existing methods in both detail recovery and color accuracy.

Authors:Songyi Gao, Zuolin Tu, Rong-Jun Qin, Yi-Hao Sun, Xiong-Hui Chen, Yang Yu
Title: NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios
Abstract:
Offline reinforcement learning (RL) aims to learn from historical data without requiring (costly) access to the environment. To facilitate offline RL research, we previously introduced NeoRL, which highlighted that datasets from real-world tasks are often conservative and limited. With years of experience applying offline RL to various domains, we have identified additional real-world challenges. These include extremely conservative data distributions produced by deployed control systems, delayed action effects caused by high-latency transitions, external factors arising from the uncontrollable variance of transitions, and global safety constraints that are difficult to evaluate during the decision-making process. These challenges are underrepresented in previous benchmarks but frequently occur in real-world tasks. To address this, we constructed the extended Near Real-World Offline RL Benchmark (NeoRL-2), which consists of 7 datasets from 7 simulated tasks along with their corresponding evaluation simulators. Benchmarking results from state-of-the-art offline RL approaches demonstrate that current methods often struggle to outperform the data-collection behavior policy, highlighting the need for more effective methods. We hope NeoRL-2 will accelerate the development of reinforcement learning algorithms for real-world applications. The benchmark project page is available at https://github.com/polixir/NeoRL2.
中文摘要:NeoRL-2作为扩展基准,针对现实世界中保守数据分布和延迟动作效应等挑战而构建,现有方法常难以超越原始行为策略,旨在推动实用强化学习算法的发展。
English Summary: NeoRL-2 is an extended benchmark addressing real-world offline RL challenges like conservative data distributions and delayed action effects, where current methods often fail to surpass the original behavior policies, aiming to advance practical algorithm development.

Authors:Ruiyi Wang, Yushuo Zheng, Zicheng Zhang, Chunyi Li, Shuaicheng Liu, Guangtao Zhai, Xiaohong Liu
Title: Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing
Abstract:
Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at https://github.com/ruiyi-w/Learning-Hazing-to-Dehazing.
中文摘要:作者提出了一种新颖的雾化-去雾化流程,包含真实雾图生成框架和加速扩散去雾方法,在去雾性能和视觉质量上均优于现有方法。
English Summary: The authors propose a novel hazing-dehazing pipeline featuring a realistic hazy image generation framework and an accelerated diffusion-based dehazing method that outperforms existing approaches in both performance and visual quality.

Authors:Zhen Zhang, Ignavier Ng, Dong Gong, Yuhang Liu, Mingming Gong, Biwei Huang, Kun Zhang, Anton van den Hengel, Javen Qinfeng Shi
Title: Analytic DAG Constraints for Differentiable DAG Learning
Abstract:
Recovering the underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge, partly due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set $\{f(x) = c_0 + \sum_{i=1}^{\infty}c_ix^i | \forall i > 0, c_i > 0; r = \lim_{i\rightarrow \infty}c_{i}/c_{i+1} > 0\}$ can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we design a series of DAG constraints and develop an efficient algorithm to evaluate them. Experiments in various settings demonstrate that our DAG constraints outperform previous state-of-the-art comparators. Our implementation is available at https://github.com/zzhang1987/AnalyticDAGLearning.
中文: 本研究提出了一类新的解析函数来构建有效的有向无环图约束,以解决可微分DAG学习中的梯度消失问题,并通过实验验证和高效评估算法证明了其优于现有方法的性能。
English: This study introduces a novel class of analytic functions to formulate effective DAG constraints that address gradient vanishing in differentiable DAG learning, demonstrating superior performance over existing methods through experimental validation and an efficient evaluation algorithm.

Authors:Rong Wang, Fabian Prada, Ziyan Wang, Zhongshi Jiang, Chengxiang Yin, Junxuan Li, Shunsuke Saito, Igor Santesteban, Javier Romero, Rohan Joshi, Hongdong Li, Jason Saragih, Yaser Sheikh
Title: FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
Abstract:
We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.
中文: 本文提出了一种从少量图像重建个性化3D人体化身的创新方法,通过从多样化人体数据学习通用先验,实现即时生成和零样本泛化,并通过联合推断化身形状、蒙皮权重和变形来提高几何保真度并减少伪影。
English: This paper introduces a novel method for reconstructing personalized 3D human avatars from few images using a universal prior learned from diverse human data, enabling instant generation and zero-shot generalization while improving geometric fidelity and reducing artifacts through joint inference of avatar shape, skinning weights, and deformations.

Authors:Sara Al-Emadi, Yin Yang, Ferda Ofli
Title: Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
Abstract:
Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets that focus on humanitarian and climate change applications. These datasets enable the investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models. Our datasets and code are available at https://github.com/RWGAI/RWDS.
中文: 本文提出了Real-World Distribution Shifts (RWDS)基准数据集,用于评估物体检测器在空间域偏移下的泛化能力,填补了在现实世界人道主义和气候变化场景中缺乏标准化领域泛化评估资源的空白。
English: This paper introduces the Real-World Distribution Shifts (RWDS) benchmark datasets to evaluate object detectors' generalization under spatial domain shifts, addressing the lack of standardized resources for real-world domain generalization scenarios in humanitarian and climate change contexts.

Authors:Maria Larchenko, Alexander Lobashev, Dmitry Guskov, Vladimir Vladimirovich Palyulin
Title: Color Transfer with Modulated Flows
Abstract:
In this work, we introduce Modulated Flows (ModFlows), a novel approach for color transfer between images based on rectified flows. The primary goal of the color transfer is to adjust the colors of a target image to match the color distribution of a reference image. Our technique is based on optimal transport and executes color transfer as an invertible transformation within the RGB color space. The ModFlows utilizes the bijective property of flows, enabling us to introduce a common intermediate color distribution and build a dataset of rectified flows. We train an encoder on this dataset to predict the weights of a rectified model for new images. After training on a set of optimal transport plans, our approach can generate plans for new pairs of distributions without additional fine-tuning. We additionally show that the trained encoder provides an image embedding, associated only with its color style. The presented method is capable of processing 4K images and achieves the state-of-the-art performance in terms of content and style similarity. Our source code is available at https://github.com/maria-larchenko/modflows
中文: 本文提出调制流(ModFlows),一种基于整流流和最优传输的可逆色彩迁移方法,能匹配图像间的色彩分布,无需对新图像对微调即可实现最先进的性能。
English: This paper presents Modulated Flows (ModFlows), an invertible color transfer method using rectified flows and optimal transport to match color distributions between images, achieving state-of-the-art performance without fine-tuning for new image pairs.

Authors:Alexander Lobashev, Maria Larchenko, Dmitry Guskov
Title: Color Conditional Generation with Sliced Wasserstein Guidance
Abstract:
We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at https://github.com/alobashev/sw-guidance/.
中文: SW-Guidance是一种无需训练的生成方法,通过改进扩散模型的采样过程来匹配参考图像的色彩分布,在保持语义连贯性的同时实现了比现有技术更优的色彩还原效果。
English: SW-Guidance is a training-free method that enhances image generation by aligning the color distribution of generated images with a reference palette through modified diffusion sampling, achieving superior color fidelity and semantic coherence compared to existing techniques.

Authors:Lingyan Ran, Lidong Wang, Guangcong Wang, Peng Wang, Yanning Zhang
Title: DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding
Abstract:
The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.
中文:DiffV2IR框架通过渐进学习模块和视觉语言理解模块解决了可见光到红外图像转换的难题,并借助IR-500K数据集显著提升了转换质量和应用范围。
English: The DiffV2IR framework introduces a Progressive Learning Module and Vision-Language Understanding Module to overcome challenges in visible-to-infrared image translation, supported by the new IR-500K dataset, significantly enhancing translation quality and applicability.

Authors:Haoliang Shang, Hanyu Wu, Guangyao Zhai, Boyang Sun, Fangjinhua Wang, Federico Tombari, Marc Pollefeys
Title: SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation
Abstract:
Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.
中文: SG-Tailor是一种自回归模型,能预测场景图中节点间无冲突的关系,通过节点添加和边修改实现连贯的图操作,并在性能上大幅超越现有方法。
English: SG-Tailor is an autoregressive model that predicts conflict-free relationships between nodes in scene graphs, enabling coherent graph manipulation through node addition and edge modification while outperforming other methods significantly.

Authors:Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, Yueming Jin
Title: MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow
Abstract:
In modern medicine, clinical diagnosis relies on the comprehensive analysis of primarily textual and visual data, drawing on medical expertise to ensure systematic and rigorous reasoning. Recent advances in large Vision-Language Models (VLMs) and agent-based methods hold great potential for medical diagnosis, thanks to the ability to effectively integrate multi-modal patient data. However, they often provide direct answers and draw empirical-driven conclusions without quantitative analysis, which reduces their reliability and clinical usability. We propose MedAgent-Pro, a new agentic reasoning paradigm that follows the diagnosis principle in modern medicine, to decouple the process into sequential components for step-by-step, evidence-based reasoning. Our MedAgent-Pro workflow presents a hierarchical diagnostic structure to mirror this principle, consisting of disease-level standardized plan generation and patient-level personalized step-by-step reasoning. To support disease-level planning, an RAG-based agent is designed to retrieve medical guidelines to ensure alignment with clinical standards. For patient-level reasoning, we propose to integrate professional tools such as visual models to enable quantitative assessments. Meanwhile, we propose to verify the reliability of each step to achieve evidence-based diagnosis, enforcing rigorous logical reasoning and a well-founded conclusion. Extensive experiments across a wide range of anatomical regions, imaging modalities, and diseases demonstrate the superiority of MedAgent-Pro to mainstream VLMs, agentic systems and state-of-the-art expert models. Ablation studies and human evaluation by clinical experts further validate its robustness and clinical relevance. Code is available at https://github.com/jinlab-imvr/MedAgent-Pro.
中文摘要:MedAgent-Pro提出了一种新型智能体推理范式,通过将医疗诊断解构为顺序组件实现循证推理,融合临床指南与量化工具提升诊断可靠性,在多样化医疗场景中优于现有主流模型。
English Summary: MedAgent-Pro introduces a novel agentic reasoning framework that decouples medical diagnosis into sequential components for evidence-based reasoning, integrating clinical guidelines and quantitative tools to enhance reliability and outperform existing models across diverse medical scenarios.

Authors:Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu
Title: Equivariant Image Modeling
Abstract:
Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.
中文摘要:本文提出了一种等变图像建模框架,通过利用平移不变性解决生成模型中子任务的优化冲突,在ImageNet图像生成任务上以更少计算资源实现了与先进自回归模型相当的性能。
English Summary: This paper introduces an equivariant image modeling framework that resolves optimization conflicts in generative models by leveraging translational symmetry, achieving state-of-the-art performance on ImageNet generation with greater computational efficiency.

Authors:Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan
Title: AdaWorld: Learning Adaptable World Models with Latent Actions
Abstract:
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
中文摘要:AdaWorld是一种创新的世界模型,通过自监督方式从视频中提取潜在动作,能够在有限数据和微调下高效适应新环境。
English Summary: AdaWorld is an innovative world model that learns latent actions from videos in a self-supervised way, enabling efficient adaptation to new environments with limited data and fine-tuning.

Authors:Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
Title: CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Abstract:
Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to accommodate visual inputs with different resolutions, and an Alignment Loss between visual and textual features for better cross-modal alignment. After continual pre-training, leading VFMs like DINOv2, SigLIP and AIMv2 achieve remarkable improvements not only in multimodal understanding tasks but also in generic classification and segmentation tasks. Remarkably, CoMP-AIMv2 achieves scores of 64.9 on ChartQA with a 0.5B LLM, while maintaining an 87.3% accuracy on ImageNet-1K and a 51.8 mIoU on ADE20K under frozen chunk evaluation.
Chinese: 本文提出CoMP多模态持续预训练框架,通过持续旋转位置嵌入和跨模态对齐损失增强视觉基础模型,使其能够灵活处理不同分辨率输入并实现视觉-语言表征对齐,在多类任务中取得显著性能提升。
English: This paper introduces CoMP, a multimodal continual pre-training pipeline that enhances Vision Foundation Models by enabling flexible input resolutions and better visual-language alignment, leading to significant performance gains across various tasks.

Authors:Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Title: xKV: Cross-Layer SVD for KV-Cache Compression
Abstract:
Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.
Chinese: xKV是一种后训练方法,通过奇异值分解将多层KV缓存压缩至共享低秩子空间,在长上下文基准测试中实现6.8倍压缩率提升,同时准确率提高2.7%。
English: xKV is a post-training method that uses Singular Value Decomposition to compress the KV-Cache across multiple layers, achieving up to 6.8x higher compression than existing techniques while improving accuracy by 2.7% on long-context benchmarks.

Authors:Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang
Title: AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration
Abstract:
Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.com/wangzx1219/AgentDropout.
中文:AgentDropout通过动态剔除冗余智能体与通信链路,有效提升多智能体系统的令牌效率与任务表现,并展现出优异的领域迁移性和结构鲁棒性。
English: AgentDropout enhances multi-agent systems by dynamically eliminating redundant agents and communication, achieving significant reductions in token usage and improved task performance.

Authors:Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu
Title: CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models
Abstract:
Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)
中文: 本研究分析了流匹配模型中的无分类器引导技术,发现其在训练初期会误导样本轨迹,并提出CFG-Zero*方法,通过优化缩放和零初始化来修正估计误差,在图像和视频生成任务中均优于原方法。
English: This study analyzes Classifier-Free Guidance (CFG) in flow matching models, revealing its tendency to misdirect samples during early training and proposing CFG-Zero* with optimized scaling and zero-initialization to correct inaccuracies, consistently outperforming CFG in image and video generation tasks.

Authors:Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Title: I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
Abstract:
Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models' internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning
中文: 本研究通过稀疏自编码器引入ReasonScore识别推理大模型中的可解释特征,揭示了与不确定性和反思相关的内部机制,增强这些特征可提升推理性能。
English: This study introduces ReasonScore to identify interpretable features in reasoning LLMs using Sparse Autoencoders, revealing mechanisms for uncertainty and reflection that enhance reasoning performance when amplified.

Authors:Yanda Chen, Gongwei Chen, Miao Zhang, Weili Guan, Liqiang Nie
Title: Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
Abstract:
Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.
中文摘要:数据集蒸馏在高IPC设置下面临合成与真实图像不兼容的问题,而提出的CCFS方法通过课程式由粗到细的选择策略有效解决了该问题,在多个数据集上实现了最先进的性能。
English Summary: Dataset distillation faces challenges in high-IPC settings due to incompatibility between distilled and real images, which the proposed CCFS method addresses through a curriculum coarse-to-fine selection strategy, achieving state-of-the-art performance across multiple datasets.

Authors:Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, Xuguang Lan
Title: Bootstrapped Model Predictive Control
Abstract:
Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available at https://github.com/wertyuilife2/bmpc.
Chinese: 自举模型预测控制(BMPC)通过模仿MPC专家策略并结合基于模型的时序差分学习,在复杂任务中显著提升了数据效率和性能表现。
English: Bootstrapped Model Predictive Control (BMPC) enhances continuous control by integrating policy imitation of an MPC expert with model-based TD-learning, achieving superior data efficiency and performance on complex tasks.

Authors:Daniel Lepe-Soltero, Thierry Artières, Anaïs Baudot, Paul Villoutreix
Title: MODIS: Multi-Omics Data Integration for Small and unpaired datasets
Abstract:
An important objective in computational biology is the efficient integration of multi-omics data. The task of integration comes with challenges: multi-omics data are most often unpaired (requiring diagonal integration), partially labeled with information about biological conditions, and in some situations such as rare diseases, only very small datasets are available. We present MODIS, a semi supervised framework designed to account for these particular challenges. To address the challenge of very small datasets, we propose to exploit information contained in larger multi-omics databases by training our model on a large reference database and a small target dataset simultaneously, effectively turning the problem of transfer learning into a problem of learning with class imbalance. MODIS performs diagonal integration on unpaired samples, leveraging class-labels to align modalities despite class imbalance and data scarcity. The architecture combines multiple variational auto-encoders, a class classifier and an adversarially trained modality classifier. To ensure training stability, we adapted a regularized relativistic GAN loss to this setting. We first validate MODIS on a synthetic dataset to assess the level of supervision needed for accurate alignment and to quantify the impact of class imbalance on predictive performance. We then apply our approach to the large public TCGA database, considering between 10 and 34 classes (cancer types and normal tissue). MODIS demonstrates high prediction accuracy, robust performance with limited supervision, and stability to class imbalance. These results position MODIS as a promising solution for challenging integration scenarios, particularly diagonal integration with a small number of samples, typical of rare diseases studies. The code is available at https://github.com/VILLOUTREIXLab/MODIS.
中文: MODIS是一个半监督框架,能有效整合未配对的多组学数据,通过利用大型参考数据库和小型目标数据集,在类别不平衡和数据稀缺的情况下仍能实现稳健的对角整合性能。
English: MODIS is a semi-supervised framework that effectively integrates unpaired multi-omics data by leveraging large reference databases and small target datasets, demonstrating robust performance in diagonal integration despite class imbalance and data scarcity.

Authors:Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
Title: MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Abstract:
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA}.
中文: 本文提出首个多概念个性化框架MC-LLaVA,通过多概念指令微调和个性化提示策略有效整合多个用户概念,并构建高质量数据集显著提升了视觉语言模型在现实场景中的适用性。
English: This paper introduces MC-LLaVA, the first multi-concept personalization framework for vision-language models that integrates multiple user concepts through innovative instruction tuning and personalized prompts, significantly enhancing real-world applicability with a new high-quality dataset.

Authors:Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, Florian Tramèr
Title: Defeating Prompt Injections by Design
Abstract:
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called. We demonstrate effectiveness of CaMeL by solving $77\%$ of tasks with provable security (compared to $84\%$ with an undefended system) in AgentDojo. We release CaMeL at https://github.com/google-research/camel-prompt-injection.
中文:大型语言模型在处理不可信数据时易受提示注入攻击,而提出的CaMeL防御机制通过分离控制流和数据流,并利用能力概念执行安全策略,为模型构建保护层以确保安全。
English: Large Language Models (LLMs) face risks from prompt injection attacks when processing untrusted data, and the proposed CaMeL defense establishes a protective layer to secure LLMs by separating control and data flows while enforcing security policies via capabilities.

Authors:Linwei Chen, Lin Gu, Liang Li, Chenggang Yan, Ying Fu
Title: Frequency Dynamic Convolution for Dense Image Prediction
Abstract:
While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.
中文: FDConv通过频域分组和动态调制技术,以少量参数增长实现了卷积权重的频率多样性,在多种视觉任务中性能卓越,显著优于需大幅增加参数的传统方法。
English: FDConv introduces frequency-based grouping and dynamic modulation techniques to create diverse convolutional weights efficiently, achieving superior performance in vision tasks with minimal parameter increase compared to previous methods.

Authors:Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang
Title: BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
Abstract:
The rise of long-context Large Language Models (LLMs) amplifies memory and bandwidth demands during autoregressive decoding, as the Key-Value (KV) cache grows with each generated token. Low-bit KV-cache quantization (e.g., 4-bit or 2-bit) can reduce memory footprint while preserving accuracy, but existing systems suffer from slow decoding due to their exclusive reliance on CUDA cores, neglecting Tensor Cores (the primary source of compute on modern GPUs). We present BitDecoding, a new long-context LLM inference system with a low-bit KV cache. BitDecoding enables efficient low-bit KV-cache decoding by cooperatively leveraging CUDA cores and Tensor Cores. It introduces methods for automatically inducing optimized layouts to exploit Tensor Cores, along with warp-level parallelization strategies for dequantization. For unified system support, BitDecoding includes a query transformation module supporting diverse attention variants, a quantization kernel that supports both tensor-wise and channel-wise scaling used in various quantization algorithms with high performance, and a dequantization kernel with a software-defined pipeline to coordinate CUDA and Tensor Cores execution for mixed-precision operations. Evaluated on RTX 4090, A100, and H100, BitDecoding accelerates decoding by up to 7.5x, 4.8x, and 8.9x, respectively, over FP16 FlashDecoding-v2, and surpasses the state-of-the-art low-bit system QServe by up to 4.3x. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x, showing substantial improvements for long-context generation. The code is available at https://github.com/DD-DuDa/BitDecoding.
中文: BitDecoding是一种创新的长上下文LLM推理系统,通过协同利用CUDA和Tensor核心优化低比特KV缓存解码,在保持精度的同时大幅提升了现有方法的解码速度。
English: BitDecoding is a novel long-context LLM inference system that optimizes low-bit KV-cache decoding by leveraging both CUDA and Tensor Cores, achieving significant speed improvements over existing methods while maintaining accuracy.

Authors:Nathan Darjana, Ryo Fujii, Hideo Saito, Hiroki Kajita
Title: EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos
Abstract:
Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.
Chinese: EgoSurgery-HTS数据集为开放手术视频中的手术工具和手部提供了像素级标注,显著提升了分割精度,有助于深入分析手术操作。
English: The EgoSurgery-HTS dataset provides pixel-level annotations for surgical tools and hands in egocentric open-surgery videos, significantly improving segmentation accuracy and enabling detailed analysis of surgical actions.

Authors:Sebastian Tewes, Yufan Chen, Omar Moured, Jiaming Zhang, Rainer Stiefelhagen
Title: SFDLA: Source-Free Document Layout Analysis
Abstract:
Document Layout Analysis (DLA) is a fundamental task in document understanding. However, existing DLA and adaptation methods often require access to large-scale source data and target labels. This requirements severely limiting their real-world applicability, particularly in privacy-sensitive and resource-constrained domains, such as financial statements, medical records, and proprietary business documents. According to our observation, directly transferring source-domain fine-tuned models on target domains often results in a significant performance drop (Avg. -32.64%). In this work, we introduce Source-Free Document Layout Analysis (SFDLA), aiming for adapting a pre-trained source DLA models to an unlabeled target domain, without access to any source data. To address this challenge, we establish the first SFDLA benchmark, covering three major DLA datasets for geometric- and content-aware adaptation. Furthermore, we propose Document Layout Analysis Adapter (DLAdapter), a novel framework that is designed to improve source-free adaptation across document domains. Our method achieves a +4.21% improvement over the source-only baseline and a +2.26% gain over existing source-free methods from PubLayNet to DocLayNet. We believe this work will inspire the DLA community to further investigate source-free document understanding. To support future research of the community, the benchmark, models, and code will be publicly available at https://github.com/s3setewe/sfdla-DLAdapter.
Chinese: 本文提出了源自由文档布局分析(SFDLA)方法,能够在无需源数据的情况下将预训练模型适配到未标注目标域,相比现有方法取得了性能提升。
English: This paper introduces Source-Free Document Layout Analysis (SFDLA), a method that adapts pre-trained models to unlabeled target domains without requiring source data, achieving performance improvements over existing approaches.

Authors:Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis
Title: LLaVAction: evaluating and training multi-modal large language models for action recognition
Abstract:
Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
Chinese: 本研究通过将EPIC-KITCHENS-100数据集重构为视频多选题形式,提出了一系列改进多模态大语言模型的方法,在多个行为理解基准测试中实现了最先进的性能表现。
English: This study enhances multi-modal large language models (MLLMs) for action recognition by reformulating the EPIC-KITCHENS-100 dataset into a video multiple-choice format and introducing methods that achieve state-of-the-art performance across multiple benchmarks.

Authors:Danrui Li, Yichao Shi, Yaluo Wang, Ziying Shi, Mubbasir Kapadia
Title: ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models
Abstract:
Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.
中文摘要:ArchSeek是一种创新的建筑案例搜索系统,通过视觉语言模型实现精确的图文查询和个性化设计推荐,有效解决了传统文本搜索工具在捕捉建筑知识视觉复杂性方面的不足。
English Summary: ArchSeek is an innovative visual search system for architects that uses vision-language models to enable precise text and image queries, offering personalized design case recommendations to overcome the limitations of traditional text-based tools.

Authors:Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, Yunfei Li, Siliang Tang, Tat-Seng Chua, Juncheng Li
Title: Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark
Abstract:
The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The project is available at https://github.com/antgroup/Similar.
中文:提出的Similar模型通过提供细粒度、多维度的奖励信号,改进了通用虚拟代理的训练与推理,利用自动数据标注和新评估基准克服了现有方法的局限。
English: The proposed Similar model provides fine-grained, multi-dimensional rewards to enhance Generalist Virtual Agents' training and inference, overcoming limitations of current methods through automated data annotation and a new benchmark for evaluation.

Authors:Xingxing Zou, Wen Zhang, Nanxuan Zhao
Title: From Fragment to One Piece: A Survey on AI-Driven Graphic Design
Abstract:
This survey provides a comprehensive overview of the advancements in Artificial Intelligence in Graphic Design (AIGD), focusing on integrating AI techniques to support design interpretation and enhance the creative process. We categorize the field into two primary directions: perception tasks, which involve understanding and analyzing design elements, and generation tasks, which focus on creating new design elements and layouts. The survey covers various subtasks, including visual element perception and generation, aesthetic and semantic understanding, layout analysis, and generation. We highlight the role of large language models and multimodal approaches in bridging the gap between localized visual features and global design intent. Despite significant progress, challenges remain to understanding human intent, ensuring interpretability, and maintaining control over multilayered compositions. This survey serves as a guide for researchers, providing information on the current state of AIGD and potential future directions\footnote{https://github.com/zhangtianer521/excellent\_Intelligent\_graphic\_design}.
本调查全面综述了人工智能在平面设计中的应用,将其分为感知与生成两大方向,强调了大语言模型的作用,并指出理解人类意图和确保可解释性等持续挑战。
This survey comprehensively reviews AI in Graphic Design, categorizing it into perception and generation tasks while highlighting the role of large language models and addressing ongoing challenges like human intent understanding and interpretability.

Authors:Arne Grobrügge, Niklas Kühl, Gerhard Satzger, Philipp Spitzer
Title: Towards Human-Understandable Multi-Dimensional Concept Discovery
Abstract:
Concept-based eXplainable AI (C-XAI) aims to overcome the limitations of traditional saliency maps by converting pixels into human-understandable concepts that are consistent across an entire dataset. A crucial aspect of C-XAI is completeness, which measures how well a set of concepts explains a model's decisions. Among C-XAI methods, Multi-Dimensional Concept Discovery (MCD) effectively improves completeness by breaking down the CNN latent space into distinct and interpretable concept subspaces. However, MCD's explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). HU-MCD uses the Segment Anything Model for concept identification and implements a CNN-specific input masking technique to reduce noise introduced by traditional masking methods. These changes to MCD, paired with the completeness relation, enable HU-MCD to enhance concept understandability while maintaining explanation faithfulness. Our experiments, including human subject studies, show that HU-MCD provides more precise and reliable explanations than existing C-XAI methods. The code is available at https://github.com/grobruegge/hu-mcd.
中文:HU-MCD通过基于SAM的概念识别和针对CNN的掩码技术,在保持解释忠实度的同时提升了概念可理解性,实验证明其比现有方法提供更精确可靠的可解释AI方案。
English: HU-MCD enhances concept-based explainable AI by improving human understandability through SAM-based concept identification and CNN-specific masking, while maintaining faithfulness and outperforming existing methods in precision and reliability.

Authors:Yihan Wang, Peiyu Liu, Xin Yang
Title: LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL
Abstract:
Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign
中文: 模式链接是文本到SQL模型在大规模数据库中的关键挑战,LinkAlign框架通过改进数据库检索和模式项定位来解决这一问题,在基准测试中达到了最优性能。
English: Schema linking is a key challenge in Text-to-SQL models for large-scale databases, addressed by the LinkAlign framework which enhances database retrieval and schema item grounding, achieving state-of-the-art performance on benchmarks.

Authors:Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu
Title: Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition
Abstract:
Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. Specifically, InfoReg slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that InfoReg outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance. The code is available at https://github.com/GeWu-Lab/InfoReg_CVPR2025.
Chinese: 本研究提出InfoReg方法,通过在关键学习窗口减缓信息充足模态的学习速度来促进信息不足模态的信息获取,从而平衡多模态学习过程并提升整体性能。
English: The study introduces InfoReg, a method that balances multimodal learning by slowing dominant modalities during the prime learning window to enhance weaker ones, improving overall performance across datasets.

Authors:Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
Title: AMD-Hummingbird: Towards an Efficient Text-to-Video Model
Abstract:
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
中文:提出的Hummingbird框架通过精简模型参数和引入视觉反馈学习,显著提升了文本到视频生成的效率,在保持高质量输出的同时实现31倍加速,并能支持更长视频序列的生成。
English: The proposed Hummingbird framework significantly enhances text-to-video generation efficiency by reducing model parameters and incorporating visual feedback learning, achieving 31x faster processing while maintaining high-quality output and supporting longer video sequences.

Authors:Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang
Title: Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
中文摘要:提出的指令对齐视觉注意力(IAVA)方法通过对比解码动态调整对无关图像标记的关注,有效缓解大型视觉语言模型中的物体幻觉问题,在多项基准测试中均展现出优越性能。
English Summary: The proposed Instruction-Aligned Visual Attention (IAVA) method mitigates object hallucinations in Large Vision-Language Models by dynamically adjusting attention to irrelevant image tokens through contrastive decoding, demonstrating superior performance across multiple benchmarks.

Authors:Zihao Chen, Hsuanyu Wu, Chi-Hsi Kung, Yi-Ting Chen, Yan-Tsung Peng
Title: ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
Abstract:
Traffic Atomic Activity which describes traffic patterns for topological intersection dynamics is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multi-label atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-the-art models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects' activities. We further provide comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognizing atomic activity in aerial view. Our source code and dataset are available at https://github.com/magecliff96/ATARS/
中文摘要:ATARS数据集作为首个用于多标签原子活动分析的航拍数据集,通过提供逐帧标注实现了精确的时间定位并减少了人工视频剪辑负担,实验揭示了识别微小物体活动等独特挑战。
English Summary: The ATARS dataset is introduced as the first aerial dataset for multi-label atomic activity analysis, providing frame-level annotations to enable precise temporal localization and reduce manual video trimming, while experiments reveal challenges like recognizing small object activities.

Authors:Soulaimene Turki, Daniel Panangian, Houda Chaabouni-Chouayakh, Ksenia Bittner
Title: AIM2PC: Aerial Image to 3D Building Point Cloud Reconstruction
Abstract:
Three-dimensional urban reconstruction of buildings from single-view images has attracted significant attention over the past two decades. However, recent methods primarily focus on rooftops from aerial images, often overlooking essential geometrical details. Additionally, there is a notable lack of datasets containing complete 3D point clouds for entire buildings, along with challenges in obtaining reliable camera pose information for aerial images. This paper addresses these challenges by presenting a novel methodology, AIM2PC , which utilizes our generated dataset that includes complete 3D point clouds and determined camera poses. Our approach takes features from a single aerial image as input and concatenates them with essential additional conditions, such as binary masks and Sobel edge maps, to enable more edge-aware reconstruction. By incorporating a point cloud diffusion model based on Centered denoising Diffusion Probabilistic Models (CDPM), we project these concatenated features onto the partially denoised point cloud using our camera poses at each diffusion step. The proposed method is able to reconstruct the complete 3D building point cloud, including wall information and demonstrates superior performance compared to existing baseline techniques. To allow further comparisons with our methodology the dataset has been made available at https://github.com/Soulaimene/AIM2PCDataset
中文: 本文提出AIM2PC新方法,通过结合图像特征与边缘感知输入,并利用配备相机位姿的扩散模型,从单张航拍图像重建完整的三维建筑点云,其性能优于现有技术且附带公开数据集。
English: This paper introduces AIM2PC, a novel method that reconstructs complete 3D building point clouds from single aerial images by integrating image features with edge-aware inputs and utilizing a diffusion model with provided camera poses, outperforming existing techniques and accompanied by a publicly available dataset.

Authors:Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
Title: PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
Abstract:
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
中文: PM4Bench作为首个并行多语言多模态基准,通过包含10种语言的平行语料库、融合视觉与文本的任务及安全性评估,解决了现有大型视觉语言模型评测的不足,并揭示了与OCR能力相关的性能差异。
English: PM4Bench is introduced as the first parallel multilingual multi-modal benchmark addressing limitations in existing LVLM evaluations by featuring a 10-language parallel corpus, integrated vision-text tasks, and safety assessments, revealing performance disparities tied to OCR capabilities.

Authors:Zequn Zeng, Yudi Su, Jianqiao Sun, Tiansheng Wen, Hao Zhang, Zhengjue Wang, Bo Chen, Hongwei Liu, Jiawei Ma
Title: Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification
Abstract:
Concept-based models can map black-box representations to human-understandable concepts, which makes the decision-making process more transparent and then allows users to understand the reason behind predictions. However, domain-specific concepts often impact the final predictions, which subsequently undermine the model generalization capabilities, and prevent the model from being used in high-stake applications. In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. In particular, we empirically demonstrate that pre-trained vision-language models (VLMs) can approximate distinct visual domain shifts via domain descriptors while prompting large Language Models (LLMs) can easily simulate a wide range of descriptors of unseen visual domains. Then, we introduce a novel plug-in domain descriptor orthogonality (DDO) regularizer to mitigate the impact of these domain-specific concepts on the final predictions. Notably, the DDO regularizer is agnostic to the design of concept-based models and we integrate it into several prevailing models. Through evaluation of domain generalization on four standard benchmarks and three newly introduced benchmarks, we demonstrate that DDO can significantly improve the out-of-distribution (OOD) generalization over the previous state-of-the-art concept-based models.Our code is available at https://github.com/joeyz0z/LanCE.
中文:提出的LanCE框架利用视觉语言模型和大语言模型识别领域特定概念,并引入描述符正交性正则化器,通过减少这些概念对预测的影响来提升模型的泛化能力。
English: The proposed LanCE framework uses vision-language models and large language models to identify domain-specific concepts and introduces a descriptor orthogonality regularizer to enhance model generalization by reducing their impact on predictions.

Authors:Wei Deng, Mengshi Qi, Huadong Ma
Title: Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Abstract:
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .
中文: 本文提出一种新颖的全局-局部树搜索算法,通过将空间规划分解为层次化结构并利用表情符号网格提示,借助视觉语言模型生成合理的3D室内场景。
English: This paper introduces a novel global-local tree search algorithm that leverages Vision-Language Models to generate plausible 3D indoor scenes by decomposing spatial planning into hierarchical levels and using emoji-grid prompts for object placement.

Authors:Zhenyu Pan, Han Liu
Title: MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
Abstract:
We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.
Chinese: MetaSpatial提出了首个基于强化学习的框架,通过多轮迭代优化增强视觉语言模型的3D空间推理能力,无需真实标注即可实时生成物理合理且美观的3D场景布局。
English: MetaSpatial introduces the first reinforcement learning framework that enhances 3D spatial reasoning in vision-language models, enabling real-time generation of physically coherent 3D scenes through iterative optimization without requiring ground truth annotations.

Authors:Zhenyu Pan, Han Liu
Title: MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
Abstract:
We present MetaSpatial, the first reinforcement learning (RL)-based framework designed to enhance 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene generation without the need for hard-coded optimizations. MetaSpatial addresses two core challenges: (i) the lack of internalized 3D spatial reasoning in VLMs, which limits their ability to generate realistic layouts, and (ii) the inefficiency of traditional supervised fine-tuning (SFT) for layout generation tasks, as perfect ground truth annotations are unavailable. Our key innovation is a multi-turn RL-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations, ensuring generated 3D layouts are coherent, physically plausible, and aesthetically consistent. Methodologically, MetaSpatial introduces an adaptive, iterative reasoning process, where the VLM refines spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively. Empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. Post-training, object placements are more realistic, aligned, and functionally coherent, validating the effectiveness of RL for 3D spatial reasoning in metaverse, AR/VR, digital twins, and game development applications. Our code, data, and training pipeline are publicly available at https://github.com/PzySeere/MetaSpatial.
Chinese: MetaSpatial提出了首个基于强化学习的框架,通过多轮迭代优化增强视觉语言模型的3D空间推理能力,无需真实标注即可实时生成物理合理且美观的3D场景布局。
English: MetaSpatial introduces the first reinforcement learning framework that enhances 3D spatial reasoning in vision-language models, enabling real-time generation of physically coherent 3D scenes through iterative optimization without requiring ground truth annotations.

Authors:Sixian Ding, Xu Jiang, Zhongjing Du, Jiaqi Cui, Xinyi Zeng, Yan Wang
Title: SIT-FER: Integration of Semantic-, Instance-, Text-level Information for Semi-supervised Facial Expression Recognition
Abstract:
Semi-supervised deep facial expression recognition (SS-DFER) has gained increasingly research interest due to the difficulty in accessing sufficient labeled data in practical settings. However, existing SS-DFER methods mainly utilize generated semantic-level pseudo-labels for supervised learning, the unreliability of which compromises their performance and undermines the practical utility. In this paper, we propose a novel SS-DFER framework that simultaneously incorporates semantic, instance, and text-level information to generate high-quality pseudo-labels. Specifically, for the unlabeled data, considering the comprehensive knowledge within the textual descriptions and instance representations, we respectively calculate the similarities between the facial vision features and the corresponding textual and instance features to obtain the probabilities at the text- and instance-level. Combining with the semantic-level probability, these three-level probabilities are elaborately aggregated to gain the final pseudo-labels. Furthermore, to enhance the utilization of one-hot labels for the labeled data, we also incorporate text embeddings excavated from textual descriptions to co-supervise model training, enabling facial visual features to exhibit semantic correlations in the text space. Experiments on three datasets demonstrate that our method significantly outperforms current state-of-the-art SS-DFER methods and even exceeds fully supervised baselines. The code will be available at https://github.com/PatrickStarL/SIT-FER.
中文: 本文提出了一种新颖的半监督深度面部表情识别框架,通过融合语义、实例和文本级信息生成高质量伪标签,并利用文本嵌入增强训练效果,在多个数据集上显著优于现有方法甚至全监督基线。
English: This paper introduces a novel semi-supervised deep facial expression recognition framework that integrates semantic, instance, and text-level information to generate reliable pseudo-labels and leverages text embeddings to enhance training, achieving superior performance over existing methods and fully supervised baselines.

Authors:Jinho Jeong, Sangmin Han, Jinwoo Kim, Seon Joo Kim
Title: Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
Abstract:
In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.
中文: LSRNA提出了一种结合潜在空间超分辨率和区域噪声添加的新框架,能够在超过1K分辨率下生成高质量图像,同时保持结构完整性并增强细节。
English: LSRNA introduces a latent space super-resolution framework combined with region-wise noise addition to enable high-quality image generation beyond 1K resolution while preserving structural integrity and enhancing details.

Authors:Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, Xuming Hu
Title: Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness
Abstract:
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-$mIoU^{Avg}_{EMM}$, $mIoU^{E}_{EMM}$, $mIoU^{Avg}_{RMM}$, and $mIoU^{E}_{RMM}$-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei-Liao/Multi-Modal-Semantic-Segmentation-Robustness-Benchmark.
Chinese: 本文首次为多模态语义分割(MMSS)建立了专门的鲁棒性基准,通过评估模型在模态缺失和噪声情况下的表现,旨在缩小研究与实际应用之间的差距。
English: This paper introduces the first dedicated robustness benchmark for multi-modal semantic segmentation (MMSS), evaluating models under missing and noisy modality scenarios to bridge the gap between research and real-world deployment.

Authors:Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He
Title: On the Perception Bottleneck of VLMs for Chart Understanding
Abstract:
Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
Chinese: 本研究揭示了大型视觉语言模型在图表理解中的感知瓶颈,归因于视觉编码器和信息提取的双重限制,并通过对比学习增强视觉编码器的方法,显著提升了模型性能。
English: This study identifies the perception bottleneck in large vision-language models for chart understanding, attributing it to limitations in both the vision encoder and information extraction, and proposes a contrastive learning-enhanced visual encoder that significantly improves model performance.

Authors:Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Nemo Chen, Xu Tang, Yao Hu, Yongchao Xu
Title: CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection
Abstract:
With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/RedAIGC/CQ-DINO.
中文摘要:CQ-DINO提出了一种基于类别查询的目标检测框架,通过将分类重构为对象查询与类别查询间的对比任务,有效解决了传统分类器中梯度稀释问题,在大型词汇数据集上实现了最优性能,并在通用数据集上保持竞争力。
English Summary: CQ-DINO is a novel object detection framework that addresses gradient dilution issues in traditional classifiers by reformulating classification as a contrastive task between object queries and category queries, achieving state-of-the-art performance on large-vocabulary datasets while maintaining competitive results on standard benchmarks.

Authors:Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng
Title: Panorama Generation From NFoV Image Done Right
Abstract:
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.
中文摘要:本文提出Distort-CLIP评估全景图畸变,并设计PanoDecouple解耦扩散框架,通过分离畸变引导与内容补全来生成兼具视觉吸引力和几何准确性的全景图像。
English Summary: This paper introduces Distort-CLIP for evaluating panorama distortion and PanoDecouple, a decoupled diffusion framework that separates distortion guidance from content completion to generate visually appealing panoramas with accurate geometry.

Authors:Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang
Title: U-REPA: Aligning Diffusion U-Nets to ViTs
Abstract:
Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.
中文:U-REPA是一种创新的表征对齐范式,通过解决空间维度不一致性并引入流形损失,成功将REPA适配到U-Net架构,在图像生成质量和收敛速度上均显著优于现有方法。
English: U-REPA is a novel representation alignment paradigm that adapts REPA to U-Net architectures by addressing spatial inconsistencies and introducing manifold loss, achieving superior image generation quality and significantly faster convergence than previous methods.

Authors:Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Qinghua Hu
Title: VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Abstract:
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods. The code will be publicly available at https://github.com/isxinxin/VTD-CLIP.
Chinese: 本文提出了一种视频到文本的离散化框架,通过利用类别标签构建的视觉码本将时序视觉数据转化为文本标记,结合置信度感知融合模块和可学习文本提示,在多个基准测试中实现了更强的鲁棒性和性能提升。
English: This paper introduces a video-to-text discretization framework that enhances temporal modeling and interpretability by converting visual data into textual tokens using a codebook derived from class labels, incorporating confidence-aware fusion and adaptive prompts to improve robustness and performance across multiple benchmarks.

Authors:Sherry X. Chen, Misha Sra, Pradeep Sen
Title: Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Abstract:
Although natural language instructions offer an intuitive way to guide automated image editing, deep-learning models often struggle to achieve high-quality results, largely due to the difficulty of creating large, high-quality training datasets. To do this, previous approaches have typically relied on text-to-image (T2I) generative models to produce pairs of original and edited images that simulate the input/output of an instruction-guided image-editing model. However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP (I-CLIP), a selfsupervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. Furthermore, we adapt Instruct-CLIP to handle noisy latent images and diffusion timesteps so that it can be used to train latent diffusion models (LDMs) and efficiently enforce alignment between the edit instruction and the image changes in latent space at any step of the diffusion pipeline. We use Instruct-CLIP to correct the InstructPix2Pix dataset and get over 120K refined samples we then use to fine-tune their model, guided by our novel I-CLIP-based loss function. The resulting model can produce edits that are more aligned with the given instructions. Our code and dataset are available at https://github.com/SherryXTChen/Instruct-CLIP.git.
中文: Instruct-CLIP是一种自监督方法,通过优化图像编辑数据集中指令与图像变化的对齐,利用校正后的数据集和新颖的损失函数提升如InstructPix2Pix等模型的编辑效果。
English: Instruct-CLIP is a self-supervised method that refines image-editing datasets by aligning instructions with image changes, enhancing the performance of models like InstructPix2Pix through a corrected dataset and novel loss function.

Authors:Xudong Mou, Rui Wang, Bo Li, Tianyu Wo, Jie Sun, Hui Wang, Xudong Liu
Title: RoCA: Robust Contrastive One-class Time Series Anomaly Detection with Contaminated Data
Abstract:
The accumulation of time-series signals and the absence of labels make time-series Anomaly Detection (AD) a self-supervised task of deep learning. Methods based on normality assumptions face the following three limitations: (1) A single assumption could hardly characterize the whole normality or lead to some deviation. (2) Some assumptions may go against the principle of AD. (3) Their basic assumption is that the training data is uncontaminated (free of anomalies), which is unrealistic in practice, leading to a decline in robustness. This paper proposes a novel robust approach, RoCA, which is the first to address all of the above three challenges, as far as we are aware. It fuses the separated assumptions of one-class classification and contrastive learning in a single training process to characterize a more complete so-called normality. Additionally, it monitors the training data and computes a carefully designed anomaly score throughout the training process. This score helps identify latent anomalies, which are then used to define the classification boundary, inspired by the concept of outlier exposure. The performance on AIOps datasets improved by 6% compared to when contamination was not considered (COCA). On two large and high-dimensional multivariate datasets, the performance increased by 5% to 10%. RoCA achieves the highest average performance on both univariate and multivariate datasets. The source code is available at https://github.com/ruiking04/RoCA.
中文摘要:时序异常检测作为自监督学习任务,传统基于正态性假设的方法存在三大局限导致鲁棒性下降;本文提出的RoCA方法通过融合单分类与对比学习假设并识别潜在异常,在各类数据集上实现性能提升5-10%。
English Summary: Time-series anomaly detection is a self-supervised learning task where traditional methods relying on normality assumptions face three key limitations, leading to reduced robustness; the proposed RoCA method addresses these by integrating one-class classification and contrastive learning while identifying latent anomalies to improve performance by 5-10% across datasets.

Authors:Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, Gang Pan
Title: PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition
Abstract:
Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(https://github.com/PaddlePaddle/PaddleOCR) and PaddleX(https://github.com/PaddlePaddle/PaddleX).
中文: PP-FormulaNet推出了两个专用公式识别模型,其中PP-FormulaNet-L的准确率比UniMERNet高6%,PP-FormulaNet-S的运行速度快16倍以上,并配有公式挖掘系统以提升数据质量和模型鲁棒性。
English: PP-FormulaNet introduces two specialized models for formula recognition, with PP-FormulaNet-L achieving 6% higher accuracy than UniMERNet and PP-FormulaNet-S running over 16 times faster, alongside a Formula Mining System to enhance data quality and model robustness.

Authors:Tianpei Zhang, Yiming Zhu, Jufeng Zhao, Guangmang Cui, Yuchen Zheng
Title: Exploring State Space Model in Wavelet Domain: An Infrared and Visible Image Fusion Network via Wavelet Transform and State Space Model
Abstract:
Deep learning techniques have revolutionized the infrared and visible image fusion (IVIF), showing remarkable efficacy on complex scenarios. However, current methods do not fully combine frequency domain features with global semantic information, which will result in suboptimal extraction of global features across modalities and insufficient preservation of local texture details. To address these issues, we propose Wavelet-Mamba (W-Mamba), which integrates wavelet transform with the state-space model (SSM). Specifically, we introduce Wavelet-SSM module, which incorporates wavelet-based frequency domain feature extraction and global information extraction through SSM, thereby effectively capturing both global and local features. Additionally, we propose a cross-modal feature attention modulation, which facilitates efficient interaction and fusion between different modalities. The experimental results indicate that our method achieves both visually compelling results and superior performance compared to current state-of-the-art methods. Our code is available at https://github.com/Lmmh058/W-Mamba.
中文: 提出的Wavelet-Mamba(W-Mamba)模型将小波变换与状态空间模型相结合,在红外与可见光图像融合中有效捕捉全局和局部特征,相比现有方法获得了更优的视觉效果和性能表现。
English: The proposed Wavelet-Mamba (W-Mamba) model integrates wavelet transform with state-space models to effectively capture global and local features in infrared and visible image fusion, achieving superior visual and performance results compared to existing methods.

Authors:Hankyul Kang, Gregor Seifer, Donghyun Lee, Jongbin Ryu
Title: Do Your Best and Get Enough Rest for Continual Learning
Abstract:
According to the forgetting curve theory, we can enhance memory retention by learning extensive data and taking adequate rest. This means that in order to effectively retain new knowledge, it is essential to learn it thoroughly and ensure sufficient rest so that our brain can memorize without forgetting. The main takeaway from this theory is that learning extensive data at once necessitates sufficient rest before learning the same data again. This aspect of human long-term memory retention can be effectively utilized to address the continual learning of neural networks. Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus' theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. The proposed view-batch model allows the network to get enough rest to learn extensive knowledge from the same samples with a recall interval of sufficient length. To this end, we specifically present two approaches: 1) a replay method that guarantees the optimal recall interval, and 2) a self-supervised learning that acquires extensive knowledge from a single training sample at a time. We empirically show that these approaches of our method are aligned with the forgetting curve theory, which can enhance long-term memory. In our experiments, we also demonstrate that our method significantly improves many state-of-the-art continual learning methods in various protocols and scenarios. We open-source this project at https://github.com/hankyul2/ViewBatchModel.
中文: 本研究基于遗忘曲线理论,提出视图批次模型,通过优化回忆间隔并结合回放方法与自监督学习,有效增强持续学习系统中的长期记忆保持能力。
English: Based on the forgetting curve theory, this study proposes a view-batch model that optimizes recall intervals and incorporates replay methods with self-supervised learning to enhance long-term memory retention in continual learning systems.

Authors:Xu Han, Yuan Tang, Jinfeng Xu, Xianzhi Li
Title: MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning
Abstract:
We introduce Monarch Sparse Tuning (MoST), the first reparameterization-based parameter-efficient fine-tuning (PEFT) method tailored for 3D representation learning. Unlike existing adapter-based and prompt-tuning 3D PEFT methods, MoST introduces no additional inference overhead and is compatible with many 3D representation learning backbones. At its core, we present a new family of structured matrices for 3D point clouds, Point Monarch, which can capture local geometric features of irregular points while offering high expressiveness. MoST reparameterizes the dense update weight matrices as our sparse Point Monarch matrices, significantly reducing parameters while retaining strong performance. Experiments on various backbones show that MoST is simple, effective, and highly generalizable. It captures local features in point clouds, achieving state-of-the-art results on multiple benchmarks, e.g., 97.5% acc. on ScanObjectNN (PB_50_RS) and 96.2% on ModelNet40 classification, while it can also combine with other matrix decompositions (e.g., Low-rank, Kronecker) to further reduce parameters.
中文: MoST是一种专为三维表示学习设计的参数高效微调方法,通过稀疏Point Monarch矩阵捕捉不规则点的局部几何特征,无需额外推理开销,在ScanObjectNN和ModelNet40等基准测试中达到了最先进性能。
English: MoST is a novel parameter-efficient fine-tuning method for 3D representation learning that uses sparse Point Monarch matrices to capture local geometric features without inference overhead, achieving state-of-the-art performance on benchmarks like ScanObjectNN and ModelNet40.

Authors:Chenxi Xie, Minghan Li, Hui Zeng, Jun Luo, Lei Zhang
Title: MaSS13K: A Matting-level Semantic Segmentation Benchmark
Abstract:
High-resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. MaSS13K provides high-quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20-50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high-resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high-level semantic features and low-level texture features across three stages, aiming to produce high-resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high-quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high-resolution and high-quality semantic segmentation. Datasets and codes can be found at https://github.com/xiechenxi99/MaSS13K.
中文:本研究提出了MaSS13K高分辨率精确标注数据集和MaSSFormer高效分割模型,能够将其精确分割能力迁移至新物体类别。
English: This work introduces MaSS13K, a high-resolution dataset with precise mask annotations, and MaSSFormer, a model designed for efficient semantic segmentation that can transfer its capabilities to new object classes.

Authors:Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Kijung Lee, Sanghyun Park
Title: GranQ: Granular Zero-Shot Quantization with Channel-Wise Activation Scaling in QAT
Abstract:
Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10. Furthermore, GranQ achieves significant speedup in quantization latency over conventional per-channel methods, demonstrating improved efficiency. With these findings, we anticipate that GranQ will inspire future research beyond conventional ZSQ approaches centered on data generation and model fine-tuning. The official code is available at https://github.com/anonymus-orange/GranQ.
中文: GranQ提出了一种高效的预缩放策略,通过全向量化计算消除运行时缩放开销,在低比特量化中实现了更高的精度和速度,显著优于现有零样本量化方法。
English: GranQ introduces an efficient pre-scaling strategy for zero-shot quantization, eliminating runtime scaling overhead to achieve superior accuracy and speedup in low-bit settings while outperforming state-of-the-art methods.

Authors:Wenrui Cai, Qingjie Liu, Yunhong Wang
Title: SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
Abstract:
Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.
中文: SPMTrack提出了一种专为视觉跟踪设计的专家混合模型,能灵活处理多样化的关系建模,并将其扩展至时空上下文,以最小的参数增长提升跟踪精度,在多个数据集上显著优于当前最先进的跟踪器。
English: SPMTrack introduces a mixture-of-experts model tailored for visual tracking to flexibly handle diverse relation modeling, extending it to spatio-temporal context for improved accuracy with minimal parameter increase, and outperforms state-of-the-art trackers across multiple datasets.

Authors:Chun Gu, Xiaofei Wei, Li Zhang, Xiatian Zhu
Title: TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
Abstract:
Inverse rendering aims to recover scene geometry, material properties, and lighting from multi-view images. Given the complexity of light-surface interactions, importance sampling is essential for the evaluation of the rendering equation, as it reduces variance and enhances the efficiency of Monte Carlo sampling. Existing inverse rendering methods typically use pre-defined non-learnable importance samplers in prior manually, struggling to effectively match the spatially and directionally varied integrand and resulting in high variance and suboptimal performance. To address this limitation, we propose the concept of learning a spatially and directionally aware importance sampler for the rendering equation to accurately and flexibly capture the unconstrained complexity of a typical scene. We further formulate TensoFlow, a generic approach for sampler learning in inverse rendering, enabling to closely match the integrand of the rendering equation spatially and directionally. Concretely, our sampler is parameterized by normalizing flows, allowing both directional sampling of incident light and probability density function (PDF) inference. To capture the characteristics of the sampler spatially, we learn a tensorial representation of the scene space, which imposes spatial conditions, together with reflected direction, leading to spatially and directionally aware sampling distributions. Our model can be optimized by minimizing the difference between the integrand and our normalizing flow. Extensive experiments validate the superiority of TensoFlow over prior alternatives on both synthetic and real-world benchmarks.
Chinese: 本文提出TensoFlow方法,通过归一化流学习具有空间和方向感知的重要性采样器,能更精准匹配渲染方程的积分项,从而在逆向渲染中有效降低方差并提升计算效率。
English: This paper introduces TensoFlow, a novel method that learns a spatially and directionally aware importance sampler using normalizing flows to reduce variance and improve efficiency in inverse rendering by better matching the rendering equation's integrand.

Authors:Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang
Title: Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Abstract:
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
Chinese: LogSAD是一种无需训练的新型多模态框架,通过思维匹配架构和多粒度检测,能同时识别逻辑和结构异常,并取得了最先进的性能。
English: LogSAD is a novel multi-modal framework that detects both logical and structural anomalies without training, achieving state-of-the-art results through a match-of-thought architecture and multi-granularity detection.

Authors:Christoforos N. Spartalis, Theodoros Semertzidis, Efstratios Gavves, Petros Daras
Title: LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
Abstract:
We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.
中文: LoTUS是一种新颖的机器遗忘方法,无需重新训练即可高效消除预训练模型中训练数据的影响,在多个基准测试中效率和效果均优于现有方法。
English: LoTUS is a novel Machine Unlearning method that efficiently removes the influence of training data from pre-trained models without retraining, outperforming existing methods in both efficiency and effectiveness across multiple benchmarks.

Authors:Changlun Li, Yao Shi, Yuyu Luo, Nan Tang
Title: Will LLMs be Professional at Fund Investment? DeepFund: A Live Arena Perspective
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, but their effectiveness in financial decision-making remains inadequately evaluated. Current benchmarks primarily assess LLMs' understanding on financial documents rather than the ability to manage assets or dig out trading opportunities in dynamic market conditions. Despite the release of new benchmarks for evaluating diversified tasks on the financial domain, we identified four major problems in these benchmarks, which are data leakage, navel-gazing, over-intervention, and maintenance-hard. To pave the research gap, we introduce DeepFund, a comprehensive arena platform for evaluating LLM-based trading strategies in a live environment. Our approach implements a multi-agent framework where they serve as multiple key roles that realize the real-world investment decision processes. Moreover, we provide a web interface that visualizes LLMs' performance with fund investment metrics across different market conditions, enabling detailed comparative analysis. Through DeepFund, we aim to provide a more realistic and fair assessment on LLM's capabilities in fund investment, offering diversified insights and revealing their potential applications in real-world financial markets. Our code is publicly available at https://github.com/HKUSTDial/DeepFund.
中文摘要:大语言模型在金融决策中的有效性评估不足,为此我们推出DeepFund——一个基于多智能体框架的实时评估平台,通过模拟真实投资流程和可视化绩效指标来全面检验LLM在基金投资中的实际应用能力。
English Summary: Large Language Models (LLMs) lack robust evaluation in financial decision-making, prompting the introduction of DeepFund, a live multi-agent platform that assesses LLM-based trading strategies through real-world simulations and performance visualization.

Authors:Fiseha B. Tesema, Alejandro Guerra Manzanares, Tianxiang Cui, Qian Zhang, Moses Solomon, Sean He
Title: LGPS: A Lightweight GAN-Based Approach for Polyp Segmentation in Colonoscopy Images
Abstract:
Colorectal cancer (CRC) is a major global cause of cancer-related deaths, with early polyp detection and removal during colonoscopy being crucial for prevention. While deep learning methods have shown promise in polyp segmentation, challenges such as high computational costs, difficulty in segmenting small or low-contrast polyps, and limited generalizability across datasets persist. To address these issues, we propose LGPS, a lightweight GAN-based framework for polyp segmentation. LGPS incorporates three key innovations: (1) a MobileNetV2 backbone enhanced with modified residual blocks and Squeeze-and-Excitation (ResE) modules for efficient feature extraction; (2) Convolutional Conditional Random Fields (ConvCRF) for precise boundary refinement; and (3) a hybrid loss function combining Binary Cross-Entropy, Weighted IoU Loss, and Dice Loss to address class imbalance and enhance segmentation accuracy. LGPS is validated on five benchmark datasets and compared with state-of-the-art(SOTA) methods. On the largest and challenging PolypGen test dataset, LGPS achieves a Dice of 0.7299 and an IoU of 0.7867, outperformed all SOTA works and demonstrating robust generalization. With only 1.07 million parameters, LGPS is 17 times smaller than the smallest existing model, making it highly suitable for real-time clinical applications. Its lightweight design and strong performance underscore its potential for improving early CRC diagnosis. Code is available at https://github.com/Falmi/LGPS/.
Chinese: 提出的LGPS框架采用基于GAN的轻量级架构,结合MobileNetV2增强模块、ConvCRF边界优化和混合损失函数,在五个基准数据集上实现了最优的息肉分割性能,其模型参数量仅为107万,比现有最小模型小17倍,非常适合实时临床诊断应用。
English: The proposed LGPS framework utilizes a lightweight GAN-based architecture with MobileNetV2 enhancements, ConvCRF boundary refinement, and a hybrid loss function to achieve superior polyp segmentation performance, outperforming state-of-the-art methods while being 17 times smaller for real-time clinical use.

Authors:Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Title: TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
Abstract:
Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.
中文:TIB-STC数据集作为首个大规模、专家标注的多领域藏语资源,通过Sun-Shine模型验证了其在藏语特定任务中实现文化对齐生成与精准指令跟随的有效性,推动低资源语言建模发展。
English: The TIB-STC dataset, comprising over 11 billion tokens across diverse domains, is introduced to advance Tibetan language modeling, with the Sun-Shine model demonstrating its effectiveness in culturally aligned generation and robust instruction-following on Tibetan-specific benchmarks.

Authors:Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, Vikash Sehwag
Title: CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
Abstract:
With the rapid advancement of generative AI, it is now possible to synthesize high-quality images in a few seconds. Despite the power of these technologies, they raise significant concerns regarding misuse. Current efforts to distinguish between real and AI-generated images may lack generalization, being effective for only certain types of generative models and susceptible to post-processing techniques like JPEG compression. To overcome these limitations, we propose a novel framework, Co-Spy, that first enhances existing semantic features (e.g., the number of fingers in a hand) and artifact features (e.g., pixel value differences), and then adaptively integrates them to achieve more general and robust synthetic image detection. Additionally, we create Co-Spy-Bench, a comprehensive dataset comprising 5 real image datasets and 22 state-of-the-art generative models, including the latest models like FLUX. We also collect 50k synthetic images in the wild from the Internet to enable evaluation in a more practical setting. Our extensive evaluations demonstrate that our detector outperforms existing methods under identical training conditions, achieving an average accuracy improvement of approximately 11% to 34%. The code is available at https://github.com/Megum1/Co-Spy.
中文:提出的Co-Spy框架通过增强语义和伪影特征,实现了对AI生成图像的鲁棒检测,在全面评估中验证其以显著准确率提升超越现有方法。
English: The proposed Co-Spy framework enhances semantic and artifact features to achieve robust detection of AI-generated images, outperforming existing methods with significant accuracy improvements as validated through comprehensive evaluations.

Authors:Kazuhiro Yamada, Li Yin, Qingrui Hu, Ning Ding, Shunsuke Iwashita, Jun Ichikawa, Kiwamu Kotani, Calvin Yeung, Keisuke Fujii
Title: TrackID3x3: A Dataset and Algorithm for Multi-Player Tracking with Identification and Pose Estimation in 3x3 Basketball Full-court Videos
Abstract:
Multi-object tracking, player identification, and pose estimation are fundamental components of sports analytics, essential for analyzing player movements, performance, and tactical strategies. However, existing datasets and methodologies primarily target mainstream team sports such as soccer and conventional 5-on-5 basketball, often overlooking scenarios involving fixed-camera setups commonly used at amateur levels, less mainstream sports, or datasets that explicitly incorporate pose annotations. In this paper, we propose the TrackID3x3 dataset, the first publicly available comprehensive dataset specifically designed for multi-player tracking, player identification, and pose estimation in 3x3 basketball scenarios. The dataset comprises three distinct subsets (Indoor fixed-camera, Outdoor fixed-camera, and Drone camera footage), capturing diverse full-court camera perspectives and environments. We also introduce the Track-ID task, a simplified variant of the game state reconstruction task that excludes field detection and focuses exclusively on fixed-camera scenarios. To evaluate performance, we propose a baseline algorithm called Track-ID algorithm, tailored to assess tracking and identification quality. Furthermore, our benchmark experiments, utilizing recent multi-object tracking algorithms (e.g., BoT-SORT-ReID) and top-down pose estimation methods (HRNet, RTMPose, and SwinPose), demonstrate robust results and highlight remaining challenges. Our dataset and evaluation benchmarks provide a solid foundation for advancing automated analytics in 3x3 basketball. Dataset and code will be available at https://github.com/open-starlab/TrackID3x3.
中文: 本文提出了首个针对3x3篮球场景的公开综合数据集TrackID3x3,涵盖多视角拍摄环境并引入专用基线算法,填补了非主流体育项目中多目标追踪与姿态分析的研究空白。
English: This paper introduces the TrackID3x3 dataset, the first comprehensive public dataset for multi-player tracking, identification, and pose estimation in 3x3 basketball, featuring diverse camera setups and a baseline algorithm to address gaps in sports analytics for non-mainstream scenarios.

Authors:Yuming Huang, Wei Gao, Zhiyuan Zhang, Maani Ghaffari, Dezhen Song, Cheng-Zhong Xu, Hui Kong
Title: Learning Orientation Field for OSM-Guided Autonomous Navigation
Abstract:
OpenStreetMap (OSM) has gained popularity recently in autonomous navigation due to its public accessibility, lower maintenance costs, and broader geographical coverage. However, existing methods often struggle with noisy OSM data and incomplete sensor observations, leading to inaccuracies in trajectory planning. These challenges are particularly evident in complex driving scenarios, such as at intersections or facing occlusions. To address these challenges, we propose a robust and explainable two-stage framework to learn an Orientation Field (OrField) for robot navigation by integrating LiDAR scans and OSM routes. In the first stage, we introduce the novel representation, OrField, which can provide orientations for each grid on the map, reasoning jointly from noisy LiDAR scans and OSM routes. To generate a robust OrField, we train a deep neural network by encoding a versatile initial OrField and output an optimized OrField. Based on OrField, we propose two trajectory planners for OSM-guided robot navigation, called Field-RRT* and Field-Bezier, respectively, in the second stage by improving the Rapidly Exploring Random Tree (RRT) algorithm and Bezier curve to estimate the trajectories. Thanks to the robustness of OrField which captures both global and local information, Field-RRT* and Field-Bezier can generate accurate and reliable trajectories even in challenging conditions. We validate our approach through experiments on the SemanticKITTI dataset and our own campus dataset. The results demonstrate the effectiveness of our method, achieving superior performance in complex and noisy conditions. Our code for network training and real-world deployment is available at https://github.com/IMRL/OriField.
中文摘要:该研究提出了一种两阶段框架,通过整合激光雷达和OpenStreetMap数据生成方向场(OrField),利用改进的路径规划算法在复杂环境中实现可靠的机器人导航。
English Summary: The study introduces a two-stage framework using an Orientation Field (OrField) that integrates LiDAR and OpenStreetMap data to enhance robot navigation, demonstrating robust trajectory planning in complex environments through improved algorithms.

Authors:Minh-Tuan Tran, Trung Le, Xuan-May Le, Thanh-Toan Do, Dinh Phung
Title: Enhancing Dataset Distillation via Non-Critical Region Refinement
Abstract:
Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at https://github.com/tmtuan1307/NRR-DD.
中文: 提出的NRR-DD方法通过在非关键区域融入类别共性信息同时保留细粒度细节,有效平衡了数据集蒸馏中的实例特定特征和类别通用特征,无需依赖软标签即可通过基于距离的知识迁移实现最优性能。
English: The proposed NRR-DD method effectively balances instance-specific and class-general features in dataset distillation by refining non-critical regions with shared class information while preserving fine-grained details, achieving state-of-the-art performance without requiring soft labels through distance-based knowledge transfer.

Authors:Feiran Wang, Bin Duan, Jiachen Tao, Nikhil Sharma, Dawen Cai, Yan Yan
Title: ZECO: ZeroFusion Guided 3D MRI Conditional Generation
Abstract:
Medical image segmentation is crucial for enhancing diagnostic accuracy and treatment planning in Magnetic Resonance Imaging (MRI). However, acquiring precise lesion masks for segmentation model training demands specialized expertise and significant time investment, leading to a small dataset scale in clinical practice. In this paper, we present ZECO, a ZeroFusion guided 3D MRI conditional generation framework that extracts, compresses, and generates high-fidelity MRI images with corresponding 3D segmentation masks to mitigate data scarcity. To effectively capture inter-slice relationships within volumes, we introduce a Spatial Transformation Module that encodes MRI images into a compact latent space for the diffusion process. Moving beyond unconditional generation, our novel ZeroFusion method progressively maps 3D masks to MRI images in latent space, enabling robust training on limited datasets while avoiding overfitting. ZECO outperforms state-of-the-art models in both quantitative and qualitative evaluations on Brain MRI datasets across various modalities, showcasing its exceptional capability in synthesizing high-quality MRI images conditioned on segmentation masks.
中文: ZECO是一种创新的三维MRI生成框架,通过ZeroFusion和空间变换模块合成具有对应分割掩模的高保真医学图像以缓解数据稀缺问题,在各项评估中均优于现有模型。
English: ZECO is a novel 3D MRI generation framework that synthesizes high-fidelity medical images with corresponding segmentation masks to address data scarcity, outperforming existing models through its innovative ZeroFusion and Spatial Transformation Module.

Authors:Yiheng Zhong, Zihong Luo, Chengzhi Liu, Feilong Tang, Zelin Peng, Ming Hu, Yingzhen Hu, Jionglong Su, Zongyuan Ge, Imran Razzak
Title: PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation
Abstract:
Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our code is released at https://github.com/logan-0623/PG-SAM.
Chinese: 针对SAM模型在医学图像分割中因领域差异和语义鸿沟导致的精度不足问题,PG-SAM通过细粒度医学文本先验对齐和多级特征融合解码器,在Synapse数据集上实现了最先进的性能。
English: The Segment Anything Model (SAM) struggles with accuracy and robustness in medical image segmentation due to domain gaps and semantic discrepancies, prompting the development of PG-SAM, which integrates fine-grained medical text priors and a multi-feature decoder to achieve state-of-the-art performance on the Synapse dataset.

Authors:Massimo Bini, Leander Girrbach, Zeynep Akata
Title: DeLoRA: Decoupling Angles and Strength in Low-rank Adaptation
Abstract:
Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.
中文: DeLoRA是一种新颖的参数高效微调方法,通过将角度学习与适应强度解耦来增强鲁棒性,在多项任务中实现优于或持平现有方法的性能,同时保持更高的稳定性。
English: DeLoRA is a novel parameter-efficient fine-tuning method that enhances robustness by decoupling angular learning from adaptation strength, achieving superior or comparable performance across multiple tasks while maintaining greater stability than existing approaches.

Authors:Valentin Gabeff, Haozhe Qi, Brendan Flaherty, Gencer Sumbül, Alexander Mathis, Devis Tuia
Title: MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps
Abstract:
Monitoring wildlife is essential for ecology and ethology, especially in light of the increasing human impact on ecosystems. Camera traps have emerged as habitat-centric sensors enabling the study of wildlife populations at scale with minimal disturbance. However, the lack of annotated video datasets limits the development of powerful video understanding models needed to process the vast amount of fieldwork data collected. To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. MammAlps contains over 14 hours of video with audio, 2D segmentation maps and 8.5 hours of individual tracks densely labeled for species and behavior. Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark using audio, video and reference scene segmentation maps as inputs. Furthermore, we also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions from 397 multi-view and long-term ecological events, including false positive triggers. We advocate that both tasks are complementary and contribute to bridging the gap between machine learning and ecology. Code and data are available at: https://github.com/eceo-epfl/MammAlps
中文:MammAlps数据集通过提供来自瑞士国家公园相机陷阱的多模态、多视角野生动物视频,解决了标注数据匮乏的问题,并提出了行为识别和生态监测的双重基准,以弥合机器学习与生态学之间的鸿沟。
English: The MammAlps dataset addresses the scarcity of annotated wildlife videos by providing multimodal, multi-view footage from Swiss camera traps, enabling hierarchical behavior recognition and ecological monitoring benchmarks to bridge machine learning and ecology.

Authors:Zhengyuan Li, Kai Cheng, Anindita Ghosh, Uttaran Bhattacharya, Liangyan Gui, Aniket Bera
Title: SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
Abstract:
Text-based 3D human motion editing is a critical yet challenging task in computer vision and graphics. While training-free approaches have been explored, the recent release of the MotionFix dataset, which includes source-text-motion triplets, has opened new avenues for training, yielding promising results. However, existing methods struggle with precise control, often leading to misalignment between motion semantics and language instructions. In this paper, we introduce a related task, motion similarity prediction, and propose a multi-task training paradigm, where we train the model jointly on motion editing and motion similarity prediction to foster the learning of semantically meaningful representations. To complement this task, we design an advanced Diffusion-Transformer-based architecture that separately handles motion similarity prediction and motion editing. Extensive experiments demonstrate the state-of-the-art performance of our approach in both editing alignment and fidelity.
Chinese: 本文提出了一种结合运动编辑与运动相似性预测的多任务训练范式,并采用先进的扩散变换器架构,在编辑对齐度和保真度方面实现了最先进的性能。
English: This paper introduces a multi-task training paradigm combining motion editing and motion similarity prediction, supported by an advanced Diffusion-Transformer architecture, achieving state-of-the-art performance in editing alignment and fidelity.

Authors:Suman Adhya, Avishek Lahiri, Debarshi Kumar Sanyal, Partha Pratim Das
Title: Evaluating Negative Sampling Approaches for Neural Topic Models
Abstract:
Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.
中文摘要:负采样通过对比正负样本提升神经主题模型的性能,实验证明其能显著增强主题连贯性、多样性及分类准确性,推动模型效果进步。
English Summary: Negative sampling enhances neural topic models by comparing positive and negative samples, significantly improving topic coherence, diversity, and classification accuracy as demonstrated in comprehensive experiments.

Authors:Haoyang Li, Siyu Zhou, Liang Wang, Guodong Long
Title: MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
Abstract:
Though CLIP-based prompt tuning significantly enhances pre-trained Vision-Language Models, existing research focuses on reconstructing the model architecture, e.g., additional loss calculation and meta-networks. These approaches generally lead to increased complexity and extended training cost. To maintain the efficiency of the tuning process, we propose plug-and-play Model-Agnostic Optimization (MAO) for prompt tuning. Without altering any components of the prompt tuning backbone, we introduce a Data-Driven Enhancement framework to optimize the distribution of the initial data, and incorporate an Alterable Regularization module to boost the task-specific feature processing pipeline, thereby improving overall performance while maintaining low computational cost. Extensive experiments on MAO demonstrate its outstanding performance and efficiency. The code of MAO is available at: https://github.com/JREion/M.A.O .
中文: 本研究提出即插即用的模型无关优化方法(MAO),通过数据驱动增强和可变正则化模块优化初始数据分布与特征处理流程,在保持低计算成本的同时显著提升了CLIP提示调优的性能。
English: To enhance CLIP-based prompt tuning without increasing complexity, this study introduces a plug-and-play Model-Agnostic Optimization (MAO) that improves data distribution and feature processing while maintaining low computational costs.

Authors:Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian
Title: DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation
Abstract:
Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.
中文: 本研究提出DiffusionTalker方法,通过个性化引导蒸馏技术捕捉说话风格并提升效率和紧凑性,显著改进了实时语音驱动的3D面部动画效果,优于现有方法。
English: This study introduces DiffusionTalker, a method that enhances real-time speech-driven 3D facial animation by using personalizer-guided distillation to capture personalized speaking styles and improve efficiency and compactness, outperforming existing approaches.

Authors:Varvara Krechetova, Denis Kochedykov
Title: GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks
Abstract:
In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.
中文摘要:本文针对商业地理信息系统实践中的多步骤空间任务建立了大语言模型评估基准,结果显示Sonnet 3.5和GPT-4o综合表现最佳,同时发现不同模型在令牌使用效率和常见错误模式上存在显著差异。
English Summary: This paper establishes a benchmark for evaluating large language models on multi-step geospatial tasks, finding Sonnet 3.5 and GPT-4o deliver the best overall performance while revealing significant differences in token efficiency and common error patterns across models.

Authors:Alexander Gielisse, Jan van Gemert
Title: End-to-End Implicit Neural Representations for Classification
Abstract:
Implicit neural representations (INRs) such as NeRF and SIREN encode a signal in neural network parameters and show excellent results for signal reconstruction. Using INRs for downstream tasks, such as classification, is however not straightforward. Inherent symmetries in the parameters pose challenges and current works primarily focus on designing architectures that are equivariant to these symmetries. However, INR-based classification still significantly under-performs compared to pixel-based methods like CNNs. This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. We show that a simple, straightforward, Transformer model applied to a meta-learned SIREN, without incorporating explicit symmetry equivariances, outperforms the current state-of-the-art. On the CIFAR-10 SIREN classification task, we improve the state-of-the-art without augmentations from 38.8% to 59.6%, and from 63.4% to 64.7% with augmentations. We demonstrate scalability on the high-resolution Imagenette dataset achieving reasonable reconstruction quality with a classification accuracy of 60.8% and are the first to do INR classification on the full ImageNet-1K dataset where we achieve a SIREN classification performance of 23.6%. To the best of our knowledge, no other SIREN classification approach has managed to set a classification baseline for any high-resolution image dataset. Our code is available at https://github.com/SanderGielisse/MWT
Chinese: 本研究提出了一种端到端的策略,通过初始化SIREN并结合学习率优化方案来提高分类精度,在多个数据集上无需引入显式对称等变性即实现了最先进的性能。
English: This work introduces an end-to-end strategy that initializes SIRENs with a learned learning-rate scheme to enhance classification accuracy, achieving state-of-the-art performance on multiple datasets without incorporating explicit symmetry equivariances.

Authors:Ze Zhang, Enyuan Zhao, Yi Jiang, Jie Nie, Xinyue Liang
Title: Challenging Dataset and Multi-modal Gated Mixture of Experts Model for Remote Sensing Copy-Move Forgery Understanding
Abstract:
The Remote Sensing Copy-Move Question Answering (RSCMQA) task focuses on interpreting complex tampering scenarios and inferring the relationships between objects. Currently, publicly available datasets often use randomly generated tampered images, which lack spatial logic and do not meet the practical needs of defense security and land resource monitoring. To address this, we propose a high-quality manually annotated RSCMQA dataset, Real-RSCM, which provides more realistic evaluation metrics for the identification and understanding of remote sensing image tampering. The tampered images in the Real-RSCM dataset are subtle, authentic, and challenging, posing significant difficulties for model discrimination capabilities. To overcome these challenges, we introduce a multimodal gated mixture of experts model (CM-MMoE), which guides multi-expert models to discern tampered information in images through multi-level visual semantics and textual joint modeling. Extensive experiments demonstrate that CM-MMoE provides a stronger benchmark for the RSCMQA task compared to general VQA and CMQA models. Our dataset and code are available at https://github.com/shenyedepisa/CM-MMoE.
中文: Real-RSCM数据集通过提供真实的人工标注图像解决了现有遥感篡改数据集的不足,而提出的CM-MMoE模型通过视觉与文本数据的多模态分析显著提升了篡改检测能力。
English: The Real-RSCM dataset addresses the limitations of existing remote sensing tampering datasets by providing realistic, manually annotated images, while the proposed CM-MMoE model enhances tampering detection through multimodal analysis of visual and textual data.

Authors:Zeyuan Ma, Hongqiao Lian, Wenjie Qiu, Yue-Jiao Gong
Title: Accurate Peak Detection in Multimodal Optimization via Approximated Landscape Learning
Abstract:
Detecting potential optimal peak areas and locating the accurate peaks in these areas are two major challenges in Multimodal Optimization problems (MMOPs). To address them, much efforts have been spent on developing novel searching operators, niching strategies and multi-objective problem transformation pipelines. Though promising, existing approaches more or less overlook the potential usage of landscape knowledge. In this paper, we propose a novel optimization framework tailored for MMOPs, termed as APDMMO, which facilitates peak detection via fully leveraging the landscape knowledge and hence capable of providing strong optimization performance on MMOPs. Specifically, we first design a novel surrogate landscape model which ensembles a group of non-linear activation units to improve the regression accuracy on diverse MMOPs. Then we propose a free-of-trial peak detection method which efficiently locates potential peak areas through back-propagation on the learned surrogate landscape model. Based on the detected peak areas, we employ SEP-CMAES for local search within these areas in parallel to further improve the accuracy of the found optima. Extensive benchmarking results demonstrate that APDMMO outperforms several up-to-date baselines. Further ablation studies verify the effectiveness of the proposed novel designs. The source-code is available at ~\href{}{https://github.com/GMC-DRL/APDMMO}.
中文: APDMMO框架通过利用景观知识,采用创新的代理模型和无试错峰值检测方法,有效解决多模态优化问题,其性能优于现有先进基准方法。
English: The APDMMO framework addresses multimodal optimization challenges by leveraging landscape knowledge through a novel surrogate model and a trial-free peak detection method, achieving superior performance compared to existing approaches.

Authors:Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Title: Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Abstract:
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
中文摘要:本文提出了一种重写驱动的增强范式(RAM),通过改写人类标注的训练数据直接生成未见过的观察-指令对,以无模拟器和省人工的方式解决了视觉语言导航领域的数据稀缺问题,在多个数据集上展现出卓越的泛化性能。
English Summary: This paper introduces a Rewriting-driven Augmentation (RAM) paradigm that addresses data scarcity in Vision-Language Navigation by generating unseen observation-instruction pairs through rewriting mechanisms, achieving superior generalization across multiple datasets without simulators or manual data cleaning.

Authors:Xiaoming Qi, Jingyang Zhang, Huazhu Fu, Guanyu Yang, Shuo Li, Yueming Jin
Title: Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for FCL
Abstract:
Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (\textbf{FedDAH}). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:https://github.com/jinlab-imvr/FedDAH.
中文: 联邦持续学习在动态医疗场景下面临灾难性遗忘和优化偏差的挑战,FedDAH方法通过动态分配超网络和自适应模型重校准技术,有效提升了跨客户端的协同学习能力。
English: Federated continual learning (FCL) faces challenges of catastrophic forgetting and biased optimization in dynamic medical scenarios, which the proposed FedDAH method addresses through a dynamic allocation hypernetwork and adaptive model recalibration to enhance collaborative learning across clients.

Authors:Hongshu Guo, Sijie Ma, Zechuan Huang, Yuzhi Hu, Zeyuan Ma, Xinglin Zhang, Yue-Jiao Gong
Title: Reinforcement Learning-based Self-adaptive Differential Evolution through Automated Landscape Feature Learning
Abstract:
Recently, Meta-Black-Box-Optimization (MetaBBO) methods significantly enhance the performance of traditional black-box optimizers through meta-learning flexible and generalizable meta-level policies that excel in dynamic algorithm configuration (DAC) tasks within the low-level optimization, reducing the expertise required to adapt optimizers for novel optimization tasks. Though promising, existing MetaBBO methods heavily rely on human-crafted feature extraction approach to secure learning effectiveness. To address this issue, this paper introduces a novel MetaBBO method that supports automated feature learning during the meta-learning process, termed as RLDE-AFL, which integrates a learnable feature extraction module into a reinforcement learning-based DE method to learn both the feature encoding and meta-level policy. Specifically, we design an attention-based neural network with mantissa-exponent based embedding to transform the solution populations and corresponding objective values during the low-level optimization into expressive landscape features. We further incorporate a comprehensive algorithm configuration space including diverse DE operators into a reinforcement learning-aided DAC paradigm to unleash the behavior diversity and performance of the proposed RLDE-AFL. Extensive benchmark results show that co-training the proposed feature learning module and DAC policy contributes to the superior optimization performance of RLDE-AFL to several advanced DE methods and recent MetaBBO baselines over both synthetic and realistic BBO scenarios. The source codes of RLDE-AFL are available at https://github.com/GMC-DRL/RLDE-AFL.
中文摘要:本文提出RLDE-AFL方法,通过基于注意力的神经网络和强化学习实现特征自动提取,在多种优化场景中显著优于现有元黑盒优化方法。
English Summary: This paper introduces RLDE-AFL, a Meta-Black-Box-Optimization method that automates feature learning through an attention-based neural network and reinforcement learning, outperforming existing methods in diverse optimization scenarios.

Authors:Zeyuan Ma, Zhiyang Huang, Jiacheng Chen, Zhiguang Cao, Yue-Jiao Gong
Title: Surrogate Learning in Meta-Black-Box Optimization: A Preliminary Study
Abstract:
Recent Meta-Black-Box Optimization (MetaBBO) approaches have shown possibility of enhancing the optimization performance through learning meta-level policies to dynamically configure low-level optimizers. However, existing MetaBBO approaches potentially consume massive function evaluations to train their meta-level policies. Inspired by the recent trend of using surrogate models for cost-friendly evaluation of expensive optimization problems, in this paper, we propose a novel MetaBBO framework which combines surrogate learning process and reinforcement learning-aided Differential Evolution algorithm, namely Surr-RLDE, to address the intensive function evaluation in MetaBBO. Surr-RLDE comprises two learning stages: surrogate learning and policy learning. In surrogate learning, we train a Kolmogorov-Arnold Networks (KAN) with a novel relative-order-aware loss to accurately approximate the objective functions of the problem instances used for subsequent policy learning. In policy learning, we employ reinforcement learning (RL) to dynamically configure the mutation operator in DE. The learned surrogate model is integrated into the training of the RL-based policy to substitute for the original objective function, which effectively reduces consumed evaluations during policy learning. Extensive benchmark results demonstrate that Surr-RLDE not only shows competitive performance to recent baselines, but also shows compelling generalization for higher-dimensional problems. Further ablation studies underscore the effectiveness of each technical components in Surr-RLDE. We open-source Surr-RLDE at https://github.com/GMC-DRL/Surr-RLDE.
中文: 提出的Surr-RLDE框架结合了基于科尔莫戈罗夫-阿诺德网络的代理学习和强化学习来动态配置差分进化算子,在保持竞争优势和强大泛化能力的同时,显著减少了函数评估次数。
English: The proposed Surr-RLDE framework combines surrogate learning using Kolmogorov-Arnold Networks with reinforcement learning to dynamically configure Differential Evolution operators, significantly reducing function evaluations while maintaining competitive optimization performance and strong generalization capabilities.

Authors:Mingde Yao, Menglu Wang, King-Man Tam, Lingen Li, Tianfan Xue, Jinwei Gu
Title: PolarFree: Polarization-based Reflection-free Imaging
Abstract:
Reflection removal is challenging due to complex light interactions, where reflections obscure important details and hinder scene understanding. Polarization naturally provides a powerful cue to distinguish between reflected and transmitted light, enabling more accurate reflection removal. However, existing methods often rely on small-scale or synthetic datasets, which fail to capture the diversity and complexity of real-world scenarios. To this end, we construct a large-scale dataset, PolaRGB, for Polarization-based reflection removal of RGB images, which enables us to train models that generalize effectively across a wide range of real-world scenarios. The PolaRGB dataset contains 6,500 well-aligned mixed-transmission image pairs, 8x larger than existing polarization datasets, and is the first to include both RGB and polarization images captured across diverse indoor and outdoor environments with varying lighting conditions. Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. Extensive experiments show that PolarFree significantly enhances image clarity in challenging reflective scenarios, setting a new benchmark for polarized imaging and reflection removal. Code and dataset are available at https://github.com/mdyao/PolarFree.
中文摘要:PolaRGB数据集和PolarFree方法通过利用偏振线索和扩散过程,有效解决了复杂场景下的反射去除难题,显著提升了各类现实环境中的图像清晰度。
English Summary: The PolaRGB dataset and PolarFree method address reflection removal challenges by leveraging polarization cues and diffusion processes, significantly improving image clarity in diverse real-world scenarios.

Authors:Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar
Title: Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
Abstract:
Large Language Models (LLMs) have significantly advanced various fields, particularly coding, mathematical reasoning, and logical problem solving. However, a critical question remains: Do these mathematical reasoning abilities persist when LLMs are presented with culturally adapted math problems? Specifically, how do LLMs perform when faced with math problems embedded in cultural contexts that have no significant representation in main stream web-scale AI training data? To explore this, we generated six synthetic cultural datasets from GSM8K, a widely used benchmark for assessing LLMs' mathematical reasoning skills. While preserving the mathematical logic and numerical values of the original GSM8K test set, we modify cultural elements such as personal names, food items, place names, etc. These culturally adapted datasets provide a more reliable framework for evaluating LLMs' mathematical reasoning under shifting cultural contexts. Our findings reveal that LLMs struggle with math problems when cultural references change, even though the underlying mathematical structure remains constant. Smaller models exhibit greater performance drops compared to larger models. Interestingly, our results also suggest that cultural familiarity can enhance mathematical reasoning. Even models with no explicit mathematical training but exposure to relevant cultural contexts sometimes outperform larger, mathematically proficient models on culturally embedded math problems. This study highlights the impact of cultural context on the mathematical reasoning abilities of LLMs, underscoring the need for more diverse and representative training data to improve robustness in real-world applications. The benchmark data sets and script for reproducing the results are available at https://github.com/akarim23131/Lost_in_Cultural_Translation
大型语言模型在文化适应数学问题上表现不佳,尽管数学逻辑未变,尤其小型模型性能下降更明显,表明文化熟悉度可提升推理能力。
Large Language Models struggle with culturally adapted math problems despite unchanged mathematical logic, revealing performance drops especially in smaller models and suggesting cultural familiarity can enhance reasoning.

Authors:Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
Title: Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
Abstract:
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.
Chinese: Vision-R1是一种新颖的大规模视觉语言模型强化学习算法,通过视觉引导反馈来提升模型性能,无需人工标注的偏好数据或专用奖励模型,在基准测试中实现了显著改进。
English: Vision-R1 is a novel reinforcement learning algorithm for Large Vision-Language Models that utilizes vision-guided feedback to enhance performance without requiring human-annotated preference data or specialized reward models, achieving significant improvements in benchmarks.

Authors:Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, Ping Tan
Title: SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry Guidance
Abstract:
Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.
中文: SymmCompletion提出了一种基于对称性指导的点云补全方法,通过局部对称变换网络和对称指导变换器,将部分输入的关键几何特征转换并优化,生成高保真且几何一致的完整点云。
English: SymmCompletion introduces a novel point cloud completion method using symmetry guidance through LSTNet and SGFormer to generate high-fidelity, geometry-consistent results by transforming and refining partial inputs.

Authors:Maochen Yang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi
Title: Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting
Abstract:
Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.
Chinese: TMTB框架通过背景修复的数据增强和视觉状态空间模型捕捉全局上下文,结合抗噪分类头减少标注噪声影响,在半监督人群计数中显著超越现有方法。
English: The TMTB framework enhances semi-supervised crowd counting by employing background inpainting for data augmentation and a Visual State Space Model for global context, achieving superior performance with an Anti-Noise classification head to mitigate annotation noise.

Authors:Baizhi Wang, Rui Yan, Wenxin Ma, Xu Zhang, Yuhao Wang, Xiaolong Li, Yunjie Gu, Zihang Jiang, S. Kevin Zhou
Title: Histomorphology-driven multi-instance learning for breast cancer WSI classification
Abstract:
Histomorphology is crucial in breast cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key and fine-grained pathological features. To address this limitation, we propose a novel framework that explicitly incorporates histomorphology (tumor cellularity, cellular morphology, and tissue architecture) into WSI classification. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at the patch level based on medical prior knowledge; (2) generating representative cluster-level features through histomorphology-driven cluster pooling; and (3) enabling WSI-level classification through histomorphology-driven multi-instance aggregation. With the incorporation of histomorphological information, our framework strengthens the model's ability to capture key and fine-grained pathological patterns, thereby enhancing WSI classification performance. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping and cancer subtyping. The code will be made available at https://github.com/Badgewho/HMDMIL.
中文摘要:本文提出了一种新颖框架,通过显式整合组织形态学特征到全切片图像分类中,增强了模型捕捉关键病理模式的能力,在乳腺癌分子分型和亚型分类方面展现出卓越的诊断准确性。
English Summary: This paper introduces a novel framework that explicitly integrates histomorphological features into whole slide image classification, enhancing the model's ability to capture fine-grained pathological patterns and demonstrating superior diagnostic accuracy for breast cancer subtyping.

Authors:Yara AlaaEldin, Francesca Odone
Title: Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images
Abstract:
Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth
中文: 本文提出一种联合深度学习架构,利用无人机单目相机在非结构化环境中高效预测深度和语义地图,在基准数据集上展现出优越性能并实现实时运算。
English: This paper introduces a joint deep-learning architecture using monocular cameras on UAVs to efficiently predict depth and semantic maps in unstructured environments, demonstrating competitive performance and real-time speed on benchmark datasets.

Authors:Yuzhi Li, Haojun Xu, Feng Tian
Title: Shot Sequence Ordering for Video Editing: Benchmarks, Metrics, and Cinematology-Inspired Computing Methods
Abstract:
With the rising popularity of short video platforms, the demand for video production has increased substantially. However, high-quality video creation continues to rely heavily on professional editing skills and a nuanced understanding of visual language. To address this challenge, the Shot Sequence Ordering (SSO) task in AI-assisted video editing has emerged as a pivotal approach for enhancing video storytelling and the overall viewing experience. Nevertheless, the progress in this field has been impeded by a lack of publicly available benchmark datasets. In response, this paper introduces two novel benchmark datasets, AVE-Order and ActivityNet-Order. Additionally, we employ the Kendall Tau distance as an evaluation metric for the SSO task and propose the Kendall Tau Distance-Cross Entropy Loss. We further introduce the concept of Cinematology Embedding, which incorporates movie metadata and shot labels as prior knowledge into the SSO model, and constructs the AVE-Meta dataset to validate the method's effectiveness. Experimental results indicate that the proposed loss function and method substantially enhance SSO task accuracy. All datasets are publicly accessible at https://github.com/litchiar/ShotSeqBench.
中文摘要:本文针对AI辅助视频编辑中镜头序列排序任务缺乏公开基准数据集的问题,提出了两个新数据集和新型评估指标,显著提升了排序准确性。
English Summary: This paper addresses the shortage of benchmark datasets for AI-assisted video editing's Shot Sequence Ordering task by introducing two new datasets and a novel evaluation metric that significantly improves ordering accuracy.

Authors:Yang Luo, Shiru Wang, Jun Liu, Jiaxuan Xiao, Rundong Xue, Zeyu Zhang, Hao Zhang, Yu Lu, Yang Zhao, Yutong Xie
Title: PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images
Abstract:
Breast cancer survival prediction in computational pathology presents a remarkable challenge due to tumor heterogeneity. For instance, different regions of the same tumor in the pathology image can show distinct morphological and molecular characteristics. This makes it difficult to extract representative features from whole slide images (WSIs) that truly reflect the tumor's aggressive potential and likely survival outcomes. In this paper, we present PathoHR, a novel pipeline for accurate breast cancer survival prediction that enhances any size of pathological images to enable more effective feature learning. Our approach entails (1) the incorporation of a plug-and-play high-resolution Vision Transformer (ViT) to enhance patch-wise WSI representation, enabling more detailed and comprehensive feature extraction, (2) the systematic evaluation of multiple advanced similarity metrics for comparing WSI-extracted features, optimizing the representation learning process to better capture tumor characteristics, (3) the demonstration that smaller image patches enhanced follow the proposed pipeline can achieve equivalent or superior prediction accuracy compared to raw larger patches, while significantly reducing computational overhead. Experimental findings valid that PathoHR provides the potential way of integrating enhanced image resolution with optimized feature learning to advance computational pathology, offering a promising direction for more accurate and efficient breast cancer survival prediction. Code will be available at https://github.com/AIGeeksGroup/PathoHR.
中文:PathoHR通过集成视觉Transformer增强病理图像分辨率,并采用相似性度量优化特征学习,提出了一种乳腺癌生存预测新方法,在保证高精度的同时显著降低了计算开销。
English: PathoHR introduces a novel pipeline for breast cancer survival prediction by enhancing pathological image resolution with a Vision Transformer and optimizing feature learning through similarity metrics, achieving high accuracy with reduced computational costs.

Authors:Zeng-Hui Zhu, Wei Lu, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, Bin Luo
Title: Real-World Remote Sensing Image Dehazing: Benchmark and Baseline
Abstract:
Remote Sensing Image Dehazing (RSID) poses significant challenges in real-world scenarios due to the complex atmospheric conditions and severe color distortions that degrade image quality. The scarcity of real-world remote sensing hazy image pairs has compelled existing methods to rely primarily on synthetic datasets. However, these methods struggle with real-world applications due to the inherent domain gap between synthetic and real data. To address this, we introduce Real-World Remote Sensing Hazy Image Dataset (RRSHID), the first large-scale dataset featuring real-world hazy and dehazed image pairs across diverse atmospheric conditions. Based on this, we propose MCAF-Net, a novel framework tailored for real-world RSID. Its effectiveness arises from three innovative components: Multi-branch Feature Integration Block Aggregator (MFIBA), which enables robust feature extraction through cascaded integration blocks and parallel multi-branch processing; Color-Calibrated Self-Supervised Attention Module (CSAM), which mitigates complex color distortions via self-supervised learning and attention-guided refinement; and Multi-Scale Feature Adaptive Fusion Module (MFAFM), which integrates features effectively while preserving local details and global context. Extensive experiments validate that MCAF-Net demonstrates state-of-the-art performance in real-world RSID, while maintaining competitive performance on synthetic datasets. The introduction of RRSHID and MCAF-Net sets new benchmarks for real-world RSID research, advancing practical solutions for this complex task. The code and dataset are publicly available at https://github.com/lwCVer/RRSHID.
中文摘要:作者提出了首个大规模真实世界遥感雾霾图像数据集RRSHID,并开发了MCAF-Net框架,通过三个创新模块在真实场景遥感图像去雾中实现最优性能,同时在合成数据上保持竞争力。
English Summary: The authors introduce RRSHID, the first large-scale real-world hazy/dehazed remote sensing image dataset, and propose MCAF-Net with three novel modules that achieves state-of-the-art performance in real-world remote sensing image dehazing while maintaining competitive results on synthetic data.

Authors:Jianjian Yin, Tao Chen, Gensheng Pei, Yazhou Yao, Liqiang Nie, Xiansheng Hua
Title: Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning
Abstract:
Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance. The source code and models are made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MCCL.
中文: 本文提出多约束一致性学习方法,通过特征知识对齐策略和自适应干预模块分别增强编码器和解码器的性能,在标准数据集上实现了最先进的半监督语义分割效果。
English: The paper introduces a Multi-Constraint Consistency Learning (MCCL) approach that enhances semi-supervised semantic segmentation by separately improving encoder and decoder performance through feature knowledge alignment and self-adaptive intervention, achieving state-of-the-art results on benchmark datasets.

Authors:Xiaoyao Zhong, Haotian Li, Jiabao Jin, Mingyu Yang, Deming Chu, Xiangyu Wang, Zhitao Shen, Wei Jia, George Gu, Yi Xie, Xuemin Lin, Heng Tao Shen, Jingkuan Song, Peng Cheng
Title: VSAG: An Optimized Search Framework for Graph-based Approximate Nearest Neighbor Search
Abstract:
Approximate nearest neighbor search (ANNS) is a fundamental problem in vector databases and AI infrastructures. Recent graph-based ANNS algorithms have achieved high search accuracy with practical efficiency. Despite the advancements, these algorithms still face performance bottlenecks in production, due to the random memory access patterns of graph-based search and the high computational overheads of vector distance. In addition, the performance of a graph-based ANNS algorithm is highly sensitive to parameters, while selecting the optimal parameters is cost-prohibitive, e.g., manual tuning requires repeatedly re-building the index. This paper introduces VSAG, an open-source framework that aims to enhance the in production performance of graph-based ANNS algorithms. VSAG has been deployed at scale in the services of Ant Group, and it incorporates three key optimizations: (i) efficient memory access: it reduces L3 cache misses with pre-fetching and cache-friendly vector organization; (ii) automated parameter tuning: it automatically selects performance-optimal parameters without requiring index rebuilding; (iii) efficient distance computation: it leverages modern hardware, scalar quantization, and smartly switches to low-precision representation to dramatically reduce the distance computation costs. We evaluate VSAG on real-world datasets. The experimental results show that VSAG achieves the state-of-the-art performance and provides up to 4x speedup over HNSWlib (an industry-standard library) while ensuring the same accuracy.
中文: VSAG是一个开源框架,通过优化内存访问、自动化参数调优和高效距离计算,显著提升了基于图的近似最近邻搜索算法在生产环境中的性能,在保证精度的同时比行业标准库快达4倍。
English: VSAG is an open-source framework that enhances the production performance of graph-based approximate nearest neighbor search algorithms through optimized memory access, automated parameter tuning, and efficient distance computation, achieving up to 4x speedup over industry standards while maintaining accuracy.

Authors:Yali Fu, Jindong Li, Qi Wang, Qianli Xing
Title: GLADMamba: Unsupervised Graph-Level Anomaly Detection Powered by Selective State Space Model
Abstract:
Unsupervised graph-level anomaly detection (UGLAD) is a critical and challenging task across various domains, such as social network analysis, anti-cancer drug discovery, and toxic molecule identification. However, existing methods often struggle to capture the long-range dependencies efficiently and neglect the spectral information. Recently, selective State Space Models (SSMs), particularly Mamba, have demonstrated remarkable advantages in capturing long-range dependencies with linear complexity and a selection mechanism. Motivated by their success across various domains, we propose GLADMamba, a novel framework that adapts the selective state space model into UGLAD field. We design View-Fused Mamba (VFM) with a Mamba-Transformer-style architecture to efficiently fuse information from different views with a selective state mechanism. We also design Spectrum-Guided Mamba (SGM) with a Mamba-Transformer-style architecture to leverage the Rayleigh quotient to guide the embedding refining process. GLADMamba can dynamically focus on anomaly-related information while discarding irrelevant information for anomaly detection. To the best of our knowledge, this is the first work to introduce Mamba and explicit spectral information to UGLAD. Extensive experiments on 12 real-world datasets demonstrate that GLADMamba outperforms existing state-of-the-art methods, achieving superior performance in UGLAD. The code is available at https://github.com/Yali-F/GLADMamba.
Chinese: GLADMamba提出了一种新颖框架,将选择性状态空间模型(特别是Mamba)引入无监督图级异常检测领域,通过有效捕获长程依赖并结合频谱信息,在多个数据集上实现了卓越性能。
English: GLADMamba introduces a novel framework that adapts selective state space models, specifically Mamba, into unsupervised graph-level anomaly detection by efficiently capturing long-range dependencies and incorporating spectral information, achieving superior performance across multiple datasets.

Authors:Adriano del Río, Christoph Stoeffler
Title: Adaptive Koopman Model Predictive Control of Simple Serial Robots
Abstract:
Approximating nonlinear systems as linear ones is a common workaround to apply control tools tailored for linear systems. This motivates our present work where we developed a data-driven model predictive controller (MPC) based on the Koopman operator framework, allowing the embedding of nonlinear dynamics in a higher dimensional, but linear function space. The controller, termed adaptive Koopman model predictive control (KMPC), uses online closed-loop feedback to learn and incrementally update a linear representation of nonlinear system dynamics, without the prior knowledge of a model. Adaptive KMPC differs from most other Koopman-based control frameworks that aim to identify high-validity-range models in advance and then enter closed-loop control without further model adaptations. To validate the controller, trajectory tracking experiments are conducted with 1R and 2R robots under force disturbances and changing model parameters. We compare the controller to classical linearization MPC and Koopman-based MPC without model updates, denoted static KMPC. The results show that adaptive KMPC can, opposed to static KMPC, generalize over unforeseen force disturbances and can, opposed to linearization MPC, handle varying dynamic parameters, while using a small set of basis functions to approximate the Koopman operator.
中文: 本文提出了一种自适应Koopman模型预测控制器,通过在线闭环反馈学习并更新非线性系统的线性表示,无需先验模型即可在干扰和参数变化下实现鲁棒的轨迹跟踪。
English: This paper introduces an adaptive Koopman model predictive controller that learns and updates linear representations of nonlinear systems online using closed-loop feedback, enabling robust trajectory tracking under disturbances and parameter changes without prior model knowledge.

Authors:Arastoo Zibaeirad, Marco Vieira
Title: Reasoning with LLMs for Zero-Shot Vulnerability Detection
Abstract:
Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think & Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think & Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git
中文摘要:该研究提出了VulnSage框架和数据集,通过结合上下文感知评估与结构化推理提示,显著提升了大型语言模型在软件漏洞检测中的准确性并降低了模糊响应。
English Summary: The study introduces VulnSage, a framework and dataset that enhances software vulnerability detection by using context-aware evaluation and structured reasoning prompts, which significantly improve accuracy and reduce ambiguity in large language models.

Authors:Yongyi Zang, Qiuqiang Kong
Title: GSound-SIR: A Spatial Impulse Response Ray-Tracing and High-order Ambisonic Auralization Python Toolkit
Abstract:
Accurate and efficient simulation of room impulse responses is crucial for spatial audio applications. However, existing acoustic ray-tracing tools often operate as black boxes and only output impulse responses (IRs), providing limited access to intermediate data or spatial fidelity. To address those problems, this paper presents GSound-SIR, a novel Python-based toolkit for room acoustics simulation that addresses these limitations. The contribution of this paper includes the follows. First, GSound-SIR provides direct access to up to millions of raw ray data points from simulations, enabling in-depth analysis of sound propagation paths that was not possible with previous solutions. Second, we introduce a tool to convert acoustic rays into high-order Ambisonic impulse response synthesis, capturing spatial audio cues with greater fidelity than standard techniques. Third, to enhance efficiency, the toolkit implements an energy-based filtering algorithm and can export only the top-X or top-X-% rays. Fourth, we propose to store the simulation results into Parquet formats, facilitating fast data I/O and seamless integration with data analysis workflows. Together, these features make GSound-SIR an advanced, efficient, and modern foundation for room acoustics research, providing researchers and developers with a powerful new tool for spatial audio exploration. We release the library under Apache 2.0 License at https://github.com/yongyizang/GSound-SIR.
中文: 本文提出GSound-SIR这一基于Python的新型声学仿真工具包,通过提供原始射线数据访问、实现高保真空间音频合成及高效数据处理功能,解决了现有工具在中间数据获取和空间保真度方面的局限。
English: This paper introduces GSound-SIR, a Python-based toolkit that overcomes limitations of existing acoustic simulators by providing direct access to raw ray data, enabling high-fidelity spatial audio synthesis, and implementing efficient data handling for advanced room acoustics research.

Authors:Wen Li, Chen Liu, Shangshu Yu, Dunqiang Liu, Yin Zhou, Siqi Shen, Chenglu Wen, Cheng Wang
Title: LightLoc: Learning Outdoor LiDAR Localization at Light Speed
Abstract:
Scene coordinate regression achieves impressive results in outdoor LiDAR localization but requires days of training. Since training needs to be repeated for each new scene, long training times make these methods impractical for time-sensitive applications, such as autonomous driving, drones, and robotics. We identify large coverage areas and vast data in large-scale outdoor scenes as key challenges that limit fast training. In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. LightLoc introduces two novel techniques to address these challenges. First, we introduce sample classification guidance to assist regression learning, reducing ambiguity from similar samples and improving training efficiency. Second, we propose redundant sample downsampling to remove well-learned frames during training, reducing training time without compromising accuracy. Additionally, the fast training and confidence estimation capabilities of sample classification enable its integration into SLAM, effectively eliminating error accumulation. Extensive experiments on large-scale outdoor datasets demonstrate that LightLoc achieves state-of-the-art performance with a 50x reduction in training time than existing methods. Our code is available at https://github.com/liw95/LightLoc.
Chinese: LightLoc通过引入样本分类指导和冗余样本下采样技术,实现了50倍训练速度提升的户外激光雷达定位,为自动驾驶等时效性应用提供了实用解决方案。
English: LightLoc introduces sample classification guidance and redundant sample downsampling to achieve state-of-the-art outdoor LiDAR localization with 50x faster training, making it practical for time-sensitive applications like autonomous driving.

Authors:Rodrigo San-José
Title: An algorithm for computing generalized Hamming weights and the Sage package GHWs
Abstract:
We generalize the Brouwer-Zimmermann algorithm, which is the most efficient general algorithm for computing the minimum distance of a random linear code, to the case of generalized Hamming weights. We also adapt this algorithm to compute the relative generalized Hamming weights of a nested pair of linear codes. In the package GHWs we provide an implementation of this algorithm in Sage, as well as several other utilities for working with generalized Hamming weights. With this implementation, we show that the proposed algorithm is faster than the naive approach of computing the generalized Hamming weights using the definition.
中文: 本文推广了高效的Brouwer-Zimmermann算法以计算广义汉明重量,并使其适用于嵌套线性码的相对广义汉明重量计算,通过Sage软件实现证明了该算法比传统方法更具速度优势。
English: This paper extends the efficient Brouwer-Zimmermann algorithm to compute generalized Hamming weights and adapts it for relative generalized Hamming weights in nested linear codes, with a Sage implementation demonstrating superior speed over naive methods.

Authors:R. D. Lin, Pengcheng Weng, Yinqiao Wang, Han Ding, Jinsong Han, Fei Wang
Title: HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving
Abstract:
LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi-supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi-supervised methods typically focus on point cloud spatial distribution or consider short-term temporal representations, e.g., only two adjacent frames, often overlooking the rich long-term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high-temporal sensitivity and low-temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross-attention mechanism. Additionally, we employ a teacher-student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state-of-the-art semi-supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs
中文: 提出的HiLoTs方法通过交叉注意力和师生框架学习并融合激光雷达数据中的高低时间敏感性表征,利用长期时序特性,在自动驾驶数据集上超越了当前最先进的半监督方法。
English: The proposed HiLoTs method leverages long-term temporal properties in LiDAR data by learning and fusing high- and low-temporal sensitivity representations through cross-attention and a teacher-student framework, outperforming state-of-the-art semi-supervised approaches on autonomous driving datasets.

Authors:Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao
Title: V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Abstract:
Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.
中文摘要:V2P-Bench基准通过多模态提示评估大视觉语言模型的视频理解能力,弥补了纯文本评估的不足,结果显示现有模型与人类专家表现存在显著差距。
English Summary: The V2P-Bench is introduced to address the limitations of text-only evaluations by assessing Large Vision-Language Models' video understanding through multimodal prompts, revealing significant performance gaps compared to human experts.

Authors:Jie Zhang, Zhongqi Wang, Shiguang Shan, Xilin Chen
Title: Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models
Abstract:
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. In this paper, toward stealthy backdoor samples, we propose Trigger without Trace (TwT) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our method achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses. It achieves an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms, revealing the vulnerabilities of current backdoor defense methods. The code is available at https://github.com/Robin-WZQ/TwT.
中文: 本文提出无痕触发器(TwT)方法,通过破坏语义一致性和注意力一致性,在保持高攻击成功率的同时实现更隐蔽的文本到图像扩散模型后门攻击,有效规避现有防御机制。
English: This paper introduces Trigger without Trace (TwT), a stealthy backdoor attack method for text-to-image diffusion models that disrupts semantic and attention consistencies to evade detection while maintaining high attack success rates.

Authors:Yu Wang, Junxian Mu, Hongzhi Huang, Qilong Wang, Pengfei Zhu, Qinghua Hu
Title: BackMix: Regularizing Open Set Recognition by Removing Underlying Fore-Background Priors
Abstract:
Open set recognition (OSR) requires models to classify known samples while detecting unknown samples for real-world applications. Existing studies show impressive progress using unknown samples from auxiliary datasets to regularize OSR models, but they have proved to be sensitive to selecting such known outliers. In this paper, we discuss the aforementioned problem from a new perspective: Can we regularize OSR models without elaborately selecting auxiliary known outliers? We first empirically and theoretically explore the role of foregrounds and backgrounds in open set recognition and disclose that: 1) backgrounds that correlate with foregrounds would mislead the model and cause failures when encounters 'partially' known images; 2) Backgrounds unrelated to foregrounds can serve as auxiliary known outliers and provide regularization via global average pooling. Based on the above insights, we propose a new method, Background Mix (BackMix), that mixes the foreground of an image with different backgrounds to remove the underlying fore-background priors. Specifically, BackMix first estimates the foreground with class activation maps (CAMs), then randomly replaces image patches with backgrounds from other images to obtain mixed images for training. With backgrounds de-correlated from foregrounds, the open set recognition performance is significantly improved. The proposed method is quite simple to implement, requires no extra operation for inferences, and can be seamlessly integrated into almost all of the existing frameworks. The code is released on https://github.com/Vanixxz/BackMix.
中文总结:本文提出BackMix方法,通过将图像前景与随机背景混合来消除前景与背景之间的关联性,从而无需精心选择辅助异常样本即可显著提升开放集识别的性能。
English Summary: This paper introduces BackMix, a method that enhances open set recognition by mixing image foregrounds with random backgrounds to eliminate fore-background correlations, improving model performance without needing carefully selected auxiliary outliers.

Authors:Heng Gao, Zhuolin He, Shoumeng Qiu, Xiangyang Xue, Jian Pu
Title: Multi-modality Anomaly Segmentation on the Road
Abstract:
Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.
中文:MMRAS+是一种创新的多模态异常分割框架,通过引入CLIP文本模态有效降低非异常区域的高异常输出,在多个验证数据集上展现出卓越的自动驾驶障碍物检测性能。
English: MMRAS+ is a novel multi-modal anomaly segmentation framework that enhances autonomous driving safety by integrating text modality via CLIP to accurately identify obstacles while reducing false positives in non-anomalous regions.

Authors:Haolin Qin, Tingfa Xu, Tianhao Li, Zhenxiang Chen, Tao Feng, Jianan Li
Title: MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
Abstract:
UAV tracking faces significant challenges in real-world scenarios, such as small-size targets and occlusions, which limit the performance of RGB-based trackers. Multispectral images (MSI), which capture additional spectral information, offer a promising solution to these challenges. However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. We also propose a novel tracking framework, UNTrack, which encodes unified spectral, spatial, and temporal features from spectrum prompts, initial templates, and sequential searches. UNTrack employs an asymmetric transformer with a spectral background eliminate mechanism for optimal relationship modeling and an encoder that continuously updates the spectrum prompt to refine tracking, improving both accuracy and efficiency. Extensive experiments show that our proposed UNTrack outperforms state-of-the-art UAV trackers. We believe our dataset and framework will drive future research in this area. The dataset is available on https://github.com/q2479036243/MUST-Multispectral-UAV-Single-Object-Tracking.
中文摘要:本文提出了首个大规模多光谱无人机单目标跟踪数据集MUST,并开发了UNTrack追踪框架,通过融合光谱、空间和时间特征,有效提升了复杂场景下的追踪性能。
English Summary: This paper introduces MUST, the first large-scale multispectral UAV tracking dataset, and proposes UNTrack, a novel framework that integrates spectral, spatial, and temporal features to significantly enhance tracking performance in challenging scenarios.

Authors:Jinyuan Liu, Bowei Zhang, Qingyun Mei, Xingyuan Li, Yang Zou, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Title: DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion
Abstract:
Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality by leveraging the strengths and mitigating the limitations of each modality. Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. Leveraging the robust search capabilities of Evolutionary Learning, our approach formulates the optimization of dual tasks as a multi-objective problem by employing an Evolutionary Algorithm (EA) to dynamically balance loss function parameters. Inspired by visual neuroscience, we integrate a Discriminative Enhancer (DE) within both the encoder and decoder, enabling the effective learning of complementary features from different modalities. Additionally, our Cross-Dimensional Embedding (CDE) block facilitates mutual enhancement between high-dimensional task features and low-dimensional fusion features, ensuring a cohesive and efficient feature integration process. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving an average improvement of 9.32% in visual quality while also enhancing subsequent high-level tasks. The code is available at https://github.com/Beate-Suy-Zhang/DCEvo.
中文摘要:提出的DCEvo框架通过进化学习和跨维度特征增强,将红外与可见光图像融合与高级任务相结合,在视觉质量和任务性能上显著优于现有方法。
English Summary: The proposed DCEvo framework integrates infrared and visible image fusion with high-level tasks through evolutionary learning and cross-dimensional feature enhancement, achieving superior visual quality and task performance improvements over existing methods.

Authors:Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, Rongrong Ji
Title: ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation
Abstract:
ComfyUI is a popular workflow-based interface that allows users to customize image generation tasks through an intuitive node-based system. However, the complexity of managing node connections and diverse modules can be challenging for users. In this paper, we introduce ComfyGPT, a self-optimizing multi-agent system designed to generate ComfyUI workflows based on task descriptions automatically. The key innovations of ComfyGPT include: (1) consisting of four specialized agents to build a multi-agent workflow generation system: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent; (2) focusing on generating precise node connections instead of entire workflows, improving generation accuracy; and (3) enhancing workflow generation through reinforcement learning. Moreover, we introduce FlowDataset, a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. Additionally, we propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM-based methods in workflow generation, making it a significant step forward in this field. Code is avaliable at https://github.com/comfygpt/comfygpt.
Chinese: ComfyGPT是一种自优化的多智能体系统,能根据任务描述自动生成ComfyUI工作流,通过专业智能体分工和强化学习技术,在生成准确性和效率上显著超越现有方法。
English: ComfyGPT is a self-optimizing multi-agent system that automatically generates ComfyUI workflows from task descriptions, utilizing specialized agents and reinforcement learning to significantly outperform existing methods in accuracy and efficiency.

Authors:Peijin Guo, Minghui Li, Hewen Pan, Ruixiang Huang, Lulu Xue, Shengqing Hu, Zikang Guo, Wei Wan, Shengshan Hu
Title: Multi-Modality Representation Learning for Antibody-Antigen Interactions Prediction
Abstract:
While deep learning models play a crucial role in predicting antibody-antigen interactions (AAI), the scarcity of publicly available sequence-structure pairings constrains their generalization. Current AAI methods often focus on residue-level static details, overlooking fine-grained structural representations of antibodies and their inter-antibody similarities. To tackle this challenge, we introduce a multi-modality representation approach that integates 3D structural and 1D sequence data to unravel intricate intra-antibody hierarchical relationships. By harnessing these representations, we present MuLAAIP, an AAI prediction framework that utilizes graph attention networks to illuminate graph-level structural features and normalized adaptive graph convolution networks to capture inter-antibody sequence associations. Furthermore, we have curated an AAI benchmark dataset comprising both structural and sequence information along with interaction labels. Through extensive experiments on this benchmark, our results demonstrate that MuLAAIP outperforms current state-of-the-art methods in terms of predictive performance. The implementation code and dataset are publicly available at https://github.com/trashTian/MuLAAIP for reproducibility.
Chinese: 针对现有抗体-抗原相互作用预测方法的不足,我们开发了MuLAAIP多模态框架,通过图网络整合三维结构和一维序列数据,在我们新构建的基准数据集上展现出卓越的预测性能。
English: To address the limitations of existing antibody-antigen interaction prediction methods, we developed MuLAAIP, a multi-modality framework that integrates 3D structural and 1D sequence data using graph networks, which demonstrates superior predictive performance on our newly curated benchmark dataset.

Authors:Jaeyeon Lee, Guantong Qi, Matthew Brady Neeley, Zhandong Liu, Hyun-Hwan Jeong
Title: ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently
Abstract:
Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI's o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: https://github.com/LiuzLab/consol, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.
中文摘要:本研究提出采用序贯概率比检验(SPRT)动态终止大语言模型采样,在保持与传统自洽方法相当准确度的同时,显著降低了计算成本。
English Summary: This study introduces Sequential Probability Ratio Testing (SPRT) to dynamically halt sampling in large language models once sufficient consistency is reached, significantly reducing computational costs while maintaining accuracy comparable to traditional self-consistency methods.

Authors:Suet-Ying Lam, Qingcheng Zeng, Jingyi Wu, Rob Voigt
Title: Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility
Abstract:
Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between pronoun production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size-with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior. Our codes and results are available at https://github.com/LingMechLab/Production-Interpretation_Asymmetries_ACL2025.
中文:研究发现,某些大型语言模型在处理隐含因果动词的代词时,表现出与人类相似的语言生成与理解不对称性,这种特性受模型规模和提示方式影响。
English: Some large language models exhibit human-like asymmetries between language production and interpretation, influenced by model size and specific prompts, as demonstrated through pronoun processing with implicit causality verbs.

Authors:Moein Heidari, Afshin Bozorgpour, AmirHossein Zarif-Fakharnia, Dorit Merhof, Ilker Hacihaliloglu
Title: Echo-E$^3$Net: Efficient Endo-Epi Spatio-Temporal Network for Ejection Fraction Estimation
Abstract:
Left ventricular ejection fraction (LVEF) is a critical metric for assessing cardiac function, widely used in diagnosing heart failure and guiding clinical decisions. Despite its importance, conventional LVEF estimation remains time-consuming and operator-dependent. Recent deep learning advancements have enhanced automation, yet many existing models are computationally demanding, hindering their feasibility for real-time clinical applications. Additionally, the interplay between spatial and temporal features is crucial for accurate estimation but is often overlooked. In this work, we propose Echo-E$^3$Net, an efficient Endo-Epi spatio-temporal network tailored for LVEF estimation. Our method introduces the Endo-Epi Cardial Border Detector (E$^2$CBD) module, which enhances feature extraction by leveraging spatial and temporal landmark cues. Complementing this, the Endo-Epi Feature Aggregator (E$^2$FA) distills statistical descriptors from backbone feature maps, refining the final EF prediction. These modules, along with a multi-component loss function tailored to align with the clinical definition of EF, collectively enhance spatial-temporal representation learning, ensuring robust and efficient EF estimation. We evaluate Echo-E$^3$Net on the EchoNet-Dynamic dataset, achieving a RMSE of 5.15 and an R$^2$ score of 0.82, setting a new benchmark in efficiency with 6.8 million parameters and only 8.49G Flops. Our model operates without pre-training, data augmentation, or ensemble methods, making it well-suited for real-time point-of-care ultrasound (PoCUS) applications. Our Code is publicly available on~\href{https://github.com/moeinheidari7829/Echo-E3Net}{\textcolor{magenta}{GitHub}}.
Chinese: Echo-E$^3$Net是一种高效的深度学习模型,通过整合心脏的空间和时间特征来准确估算左心室射血分数,以低计算成本实现高性能,适用于实时临床诊断。
English: Echo-E$^3$Net is an efficient deep learning model that accurately estimates left ventricular ejection fraction by integrating spatial and temporal cardiac features, achieving high performance with minimal computational demands for real-time clinical use.

Authors:Nusrat Munia, Abdullah-Al-Zubaer Imran
Title: DermDiff: Generative Diffusion Model for Mitigating Racial Biases in Dermatology Diagnosis
Abstract:
Skin diseases, such as skin cancer, are a significant public health issue, and early diagnosis is crucial for effective treatment. Artificial intelligence (AI) algorithms have the potential to assist in triaging benign vs malignant skin lesions and improve diagnostic accuracy. However, existing AI models for skin disease diagnosis are often developed and tested on limited and biased datasets, leading to poor performance on certain skin tones. To address this problem, we propose a novel generative model, named DermDiff, that can generate diverse and representative dermoscopic image data for skin disease diagnosis. Leveraging text prompting and multimodal image-text learning, DermDiff improves the representation of underrepresented groups (patients, diseases, etc.) in highly imbalanced datasets. Our extensive experimentation showcases the effectiveness of DermDiff in terms of high fidelity and diversity. Furthermore, downstream evaluation suggests the potential of DermDiff in mitigating racial biases for dermatology diagnosis. Our code is available at https://github.com/Munia03/DermDiff
中文: 针对皮肤疾病AI诊断中因数据有限导致的偏见问题,DermDiff模型通过文本提示和多模态学习生成多样化皮肤镜图像,有效提升数据代表性并减少种族偏差,同时保证图像的高保真度和多样性。
English: To address biases in AI-based skin disease diagnosis from limited datasets, DermDiff, a novel generative model, creates diverse dermoscopic images through text prompting and multimodal learning, enhancing representation and reducing racial bias while maintaining high fidelity and diversity.

Authors:Ayberk Acar, Jumanh Atoum, Peter S. Connor, Clifford Pierre, Carisa N. Lynch, Nicholas L. Kavoussi, Jie Ying Wu
Title: NAVIUS: Navigated Augmented Reality Visualization for Ureteroscopic Surgery
Abstract:
Ureteroscopy is the standard of care for diagnosing and treating kidney stones and tumors. However, current ureteroscopes have a limited field of view, requiring significant experience to adequately navigate the renal collecting system. This is evidenced by the fact that inexperienced surgeons have higher rates of missed stones. One-third of patients with residual stones require re-operation within 20 months. In order to aid surgeons to fully explore the kidney, this study presents the Navigated Augmented Reality Visualization for Ureteroscopic Surgery (NAVIUS) system. NAVIUS assists surgeons by providing 3D maps of the target anatomy, real-time scope positions, and preoperative imaging overlays. To enable real-time navigation and visualization, we integrate an electromagnetic tracker-based navigation pipeline with augmented reality visualizations. NAVIUS connects to 3D Slicer and Unity with OpenIGTLink, and uses HoloLens 2 as a holographic interface. We evaluate NAVIUS through a user study where surgeons conducted ureteroscopy on kidney phantoms with and without visual guidance. With our proposed system, we observed that surgeons explored more areas within the collecting system with NAVIUS (average 23.73% increase), and NASA-TLX metrics were improved (up to 27.27%). NAVIUS acts as a step towards better surgical outcomes and surgeons' experience. The codebase for the system will be available at: https://github.com/vu-maple-lab/NAVIUS.
Chinese: NAVIUS系统通过提供3D解剖图谱和实时导航,显著提升了输尿管镜手术的探查范围并减轻了医生的操作负担。
English: The NAVIUS system enhances ureteroscopic surgery by providing 3D anatomical maps and real-time navigation, significantly improving surgical exploration and reducing cognitive load for surgeons.

Authors:Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra
Title: Variance Control via Weight Rescaling in LLM Pre-training
Abstract:
The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.
中文: 层索引重缩放(LIR)和目标方差重缩放(TVR)技术通过优化权重初始化和方差控制,显著提升大语言模型预训练效果,改善下游任务表现并缓解激活异常问题。
English: The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve LLM pre-training by optimizing weight initialization and variance control, leading to enhanced downstream task performance and reduced activation issues.

Authors:Tianwen Zhou, Jing Wang, Songtao Wu, Kuanhong Xu
Title: ProDehaze: Prompting Diffusion Models Toward Faithful Image Dehazing
Abstract:
Recent approaches using large-scale pretrained diffusion models for image dehazing improve perceptual quality but often suffer from hallucination issues, producing unfaithful dehazed image to the original one. To mitigate this, we propose ProDehaze, a framework that employs internal image priors to direct external priors encoded in pretrained models. We introduce two types of \textit{selective} internal priors that prompt the model to concentrate on critical image areas: a Structure-Prompted Restorer in the latent space that emphasizes structure-rich regions, and a Haze-Aware Self-Correcting Refiner in the decoding process to align distributions between clearer input regions and the output. Extensive experiments on real-world datasets demonstrate that ProDehaze achieves high-fidelity results in image dehazing, particularly in reducing color shifts. Our code is at https://github.com/TianwenZhou/ProDehaze.
中文: ProDehaze是一种创新框架,通过选择性内部图像先验引导预训练扩散模型,在图像去雾中有效减少幻觉和色彩偏移,同时保持高保真度。
English: ProDehaze is a novel framework that leverages selective internal image priors to guide pretrained diffusion models, effectively reducing hallucination and color shifts in image dehazing while maintaining high fidelity.

Authors:Ran Liu, Fengyu Zhang, Cong Yu, Longjiang Yang, Zhuofan Wen, Siyuan Zhang, Hailiang Yao, Shun Chen, Zheng Lian, Bin Liu
Title: Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition
Abstract:
This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW
中文: 本文提出融合视觉变换器和残差网络特征的多模态情绪识别方法,在C-EXPR-DB等复杂场景数据集上展现出优越性能。
English: This paper introduces a multimodal emotion recognition method combining Vision Transformer and ResNet features, demonstrating superior performance on complex datasets like C-EXPR-DB in the ABAW competition.

Authors:Zhuoshi Pan, Yu Li, Honglin Lin, Qizhi Pei, Zinan Tang, Wei Wu, Chenlin Ming, H. Vicky Zhao, Conghui He, Lijun Wu
Title: LEMMA: Learning from Errors for MatheMatical Advancement in LLMs
Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.
Chinese: LEMMA方法通过利用包含错误步骤及其反思后修正为正确答案的数据进行微调,提升了大型语言模型的数学推理能力,使其能在生成过程中自主纠错,无需依赖外部评判模型,并实现了显著的性能提升。
English: The LEMMA approach enhances large language models' mathematical reasoning by fine-tuning them on data that includes incorrect solutions with errors and reflections leading to correct answers, enabling self-correction without external models and achieving superior performance.

Authors:Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang
Title: A Comprehensive Survey on Long Context Language Modeling
Abstract:
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
中文: 本文对长上下文语言模型的最新进展进行了全面综述,涵盖其开发、训练、部署和评估,并探讨了应用场景与未来发展方向。
English: This paper presents a comprehensive survey on recent advances in long-context language models, covering their development, training, deployment, and evaluation while exploring applications and future directions.

Authors:Haochen Zhang, Nader Zantout, Pujith Kachana, Ji Zhang, Wenshan Wang
Title: IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Abstract:
With the recent rise of large language models, vision-language models, and other general foundation models, there is growing potential for multimodal, multi-task robotics that can operate in diverse environments given natural language input. One such application is indoor navigation using natural language instructions. However, despite recent progress, this problem remains challenging due to the 3D spatial reasoning and semantic understanding required. Additionally, the language used may be imperfect or misaligned with the scene, further complicating the task. To address this challenge, we curate a benchmark dataset, IRef-VLA, for Interactive Referential Vision and Language-guided Action in 3D Scenes with imperfect references. IRef-VLA is the largest real-world dataset for the referential grounding task, consisting of over 11.5K scanned 3D rooms from existing datasets, 7.6M heuristically generated semantic relations, and 4.7M referential statements. Our dataset also contains semantic object and room annotations, scene graphs, navigable free space annotations, and is augmented with statements where the language has imperfections or ambiguities. We verify the generalizability of our dataset by evaluating with state-of-the-art models to obtain a performance baseline and also develop a graph-search baseline to demonstrate the performance bound and generation of alternatives using scene-graph knowledge. With this benchmark, we aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems. The dataset and all source code is publicly released at https://github.com/HaochenZ11/IRef-VLA.
中文: IRef-VLA基准数据集通过提供包含不完善参考信息的海量3D场景数据,旨在解决自然语言室内导航的挑战,推动稳健交互式导航系统的发展。
English: The IRef-VLA benchmark dataset is introduced to address challenges in indoor navigation using natural language by providing extensive 3D scene data with imperfect references, aiming to advance robust, interactive navigation systems.

Authors:Yansi Li, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Qiuzhi Liu, Rui Wang, Zhuosheng Zhang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
Abstract:
Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.
中文摘要:本文提出PANEL方法,通过自我生成的自然语言评语替代传统标量奖励来指导推理步骤,无需特定任务验证器即可显著提升大语言模型在复杂推理任务中的表现。
English Summary: This paper introduces PANEL, a novel inference-time scaling method that uses self-generated natural language critiques instead of scalar rewards to guide reasoning steps, significantly improving LLMs' performance on complex reasoning tasks without requiring task-specific verifiers.

Authors:Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu, Xiang He, Judah Goldfeder, Nadav Timor, Allen G Roush, Ravid Shwartz-Ziv
Title: NdLinear: Don't Flatten! Building Superior Neural Architectures by Preserving N-D Structure
Abstract:
Many high-impact machine learning tasks involve multi-dimensional data such as images, volumetric medical scans, and multivariate time-series. Yet, most neural architectures flatten these inputs, discarding critical cross-dimension information. We introduce $\textbf{NdLinear}$, a novel linear transformation that circumvents this destructive flattening by operating directly on tensors. NdLinear applies transformations separately along each data dimension, thereby preserving the native data structure. Extensive experiments demonstrate NdLinear's capacity to significantly enhance representational power, achieve dramatic parameter reductions (often by orders of magnitude), and maintain a favorable computational profile. For instance, when applied to Large Language Model finetuning, our $\textbf{NdLinear-LoRA}$ delivers comparable or improved accuracy on reasoning tasks using up to $9\times$ fewer trainable parameters than standard LoRA. These broad advantages of NdLinear are consistently validated across diverse neural architectures (CNNs, RNNs, Transformers, MLPs) and data domains, including vision, language, time-series, and tabular tasks. As a versatile, drop-in replacement for standard linear layers, NdLinear processes data in its original N-dimensional form, offering a foundational component for developing more efficient and powerful next-generation neural architectures.
中文: NdLinear是一种基于张量的线性层,无需对输入进行扁平化处理,能在保持数据结构的同时大幅减少参数并提升计算效率,适用于多种深度学习任务。
English: NdLinear is a tensor-based linear layer that eliminates the need for input flattening, achieving significant parameter reduction and computational efficiency while preserving data structure across various deep learning tasks.

Authors:Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu, Xiang He, Judah Goldfeder, Nadav Timor, Allen G Roush, Ravid Shwartz-Ziv
Title: NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks
Abstract:
In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce $\mathit{NdLinear}$, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our $\mathit{NdLinear-LoRA}$ matches or exceeds standard LoRA on language reasoning tasks using up to $9\times$ fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.
中文: NdLinear是一种基于张量的线性层,无需对输入进行扁平化处理,能在保持数据结构的同时大幅减少参数并提升计算效率,适用于多种深度学习任务。
English: NdLinear is a tensor-based linear layer that eliminates the need for input flattening, achieving significant parameter reduction and computational efficiency while preserving data structure across various deep learning tasks.

Authors:Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang
Title: OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Abstract:
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.
中文摘要:OpenVLThinker作为首个开源视觉语言大模型,通过交替使用监督微调与强化学习实现了卓越的链式推理能力,在多项视觉推理基准测试中取得显著性能突破。
English Summary: OpenVLThinker is an open-source vision-language model that achieves superior visual reasoning through alternating supervised fine-tuning and reinforcement learning, significantly advancing performance across multiple benchmarks.

Authors:Kun Chu, Xufeng Zhao, Cornelius Weber, Stefan Wermter
Title: LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language
Abstract:
Bimanual robotic manipulation provides significant versatility, but also presents an inherent challenge due to the complexity involved in the spatial and temporal coordination between two hands. Existing works predominantly focus on attaining human-level manipulation skills for robotic hands, yet little attention has been paid to task planning on long-horizon timescales. With their outstanding in-context learning and zero-shot generation abilities, Large Language Models (LLMs) have been applied and grounded in diverse robotic embodiments to facilitate task planning. However, LLMs still suffer from errors in long-horizon reasoning and from hallucinations in complex robotic tasks, lacking a guarantee of logical correctness when generating the plan. Previous works, such as LLM+P, extended LLMs with symbolic planners. However, none have been successfully applied to bimanual robots. New challenges inevitably arise in bimanual manipulation, necessitating not only effective task decomposition but also efficient task allocation. To address these challenges, this paper introduces LLM+MAP, a bimanual planning framework that integrates LLM reasoning and multi-agent planning, automating effective and efficient bimanual task planning. We conduct simulated experiments on various long-horizon manipulation tasks of differing complexity. Our method is built using GPT-4o as the backend, and we compare its performance against plans generated directly by LLMs, including GPT-4o, V3 and also recent strong reasoning models o1 and R1. By analyzing metrics such as planning time, success rate, group debits, and planning-step reduction rate, we demonstrate the superior performance of LLM+MAP, while also providing insights into robotic reasoning. Code is available at https://github.com/Kchu/LLM-MAP.
中文: 本文提出LLM+MAP框架,通过结合大语言模型与多智能体规划,实现双手机器人任务的高效分解与分配,在复杂操作任务中展现出优于直接大语言模型规划的性能。
English: This paper introduces LLM+MAP, a framework that integrates large language models with multi-agent planning to automate efficient task decomposition and allocation for bimanual robots, demonstrating superior performance over direct LLM planning in complex manipulation tasks.

Authors:Xianghan Meng, Zhiyuan Huang, Wei He, Xianbiao Qi, Rong Xiao, Chun-Guang Li
Title: Exploring a Principled Framework for Deep Subspace Clustering
Abstract:
Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at https://github.com/mengxianghan123/PRO-DSC.
中文:本文提出了PRO-DSC原则性深度子空间聚类框架,通过有效正则化防止特征坍缩,理论上保证学习到的表示位于正交子空间上,实验证明其具有优越性能。
English: This paper introduces PRO-DSC, a principled deep subspace clustering framework that prevents feature collapse through effective regularization and theoretically guarantees learned representations lie on orthogonal subspaces, demonstrating superior performance in experiments.

Authors:Hiromu Taketsugu, Takeru Oba, Takahiro Maeda, Shohei Nobuhara, Norimichi Ukita
Title: Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
Abstract:
Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads. Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code is available at: https://github.com/ImIntheMiddle/EmLoco.
中文摘要:提出的运动体现框架通过基于物理定律的运动生成来显式评估轨迹合理性,借助可微分运动值和过滤机制有效提升了多种数据集中的轨迹预测精度。
English Summary: The proposed Locomotion Embodiment framework explicitly evaluates trajectory plausibility through physics-based locomotion generation, enhancing prediction accuracy across diverse datasets by integrating differentiable locomotion values and filtering mechanisms.

Authors:Shuang Guo, Friedhelm Hamann, Guillermo Gallego
Title: Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
Abstract:
Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are present and recorded in the event data, or neither is captured. Previous works treat the recovery of these two visual quantities as separate tasks, which does not fit with the above-mentioned nature of event cameras and overlooks the inherent relations between them. We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. From the data generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity. This error is further combined with the contrast maximization framework to form a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show our method's state-of-the-art performance: in optical flow estimation, it reduces EPE by 20% and AE by 25% compared to unsupervised approaches, while delivering competitive intensity estimation results, particularly in high dynamic range scenarios. Our method also achieves shorter inference time than all other optical flow methods and many of the image reconstruction methods, while they output only one quantity. Project page: https://github.com/tub-rip/E2FAI
Chinese: 我们提出了一种无监督学习框架,通过单一网络联合估计光流和图像强度,在精度和效率上均实现了显著提升,达到了领先性能。
English: We introduce an unsupervised learning framework that jointly estimates optical flow and image intensity using a single network, achieving state-of-the-art performance with significant improvements in accuracy and efficiency.

Authors:Jie Mei, Chenyu Lin, Yu Qiu, Yaonan Wang, Hui Zhang, Ziyang Wang, Dong Dai
Title: Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images
Abstract:
Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation. The dataset and code are available at https://github.com/mj129/CIPA.
Chinese: 本研究提出了用于肺肿瘤分割的大规模PET-CT数据集PCLT20K,并设计了一种跨模态交互感知网络(CIPA),通过有效整合多模态特征来提升分割精度,其性能优于现有最先进方法。
English: The study introduces PCLT20K, a large-scale PET-CT dataset for lung tumor segmentation, and proposes a cross-modal interactive perception network (CIPA) that effectively integrates multi-modal features to improve segmentation accuracy, outperforming current state-of-the-art methods.

Authors:Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
Title: KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Abstract:
We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.
中文: KL3M分词器专为法律、金融和政府文本设计,对专业术语可减少高达83%的标记使用量,并提供字符级版本用于文本纠错,有效提升专业领域处理效率并降低计算需求。
English: The KL3M tokenizers are specialized for legal, financial, and governmental text, offering up to 83% fewer tokens for domain terms and character-level versions for text correction tasks, enhancing efficiency and computational savings in professional applications.

Authors:Devavrat Tomar, Guillaume Vray, Dwarikanath Mahapatra, Sudipta Roy, Jean-Philippe Thiran, Behzad Bozorgtabar
Title: Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology
Abstract:
In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.
中文: 本研究提出一种新颖的病理切片图像少样本分类方法,通过将语言模型的病理先验知识融入视觉语言模型框架,并利用提示学习仅需少量标注数据即可实现优于现有方法的性能。
English: This study introduces a novel method for few-shot classification of histopathology whole slide images by integrating pathological prior knowledge from language models into a vision-language model framework, achieving superior performance with minimal labeled data through prompt learning.

Authors:Yu-Hsi Chen
Title: Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID
Abstract:
Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the well-established YOLOv5 with DeepSORT combination, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the 4th Anti-UAV Challenge metrics and reach competitive performance. Notably, we achieved strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a "Strong Baseline" for multi-UAV tracking tasks. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at https://github.com/wish44165/YOLOv12-BoT-SORT-ReID .
Chinese: 本文提出了一种基于YOLOv12和BoT-SORT的鲁棒跟踪框架,通过针对性训练策略,在不使用对比度增强或时序融合的情况下,实现了热红外视频中多无人机跟踪的优异性能。
English: This paper introduces a robust tracking framework using YOLOv12 and BoT-SORT with specialized training strategies, achieving competitive multi-UAV tracking in thermal infrared video without relying on contrast enhancement or temporal fusion.

Authors:Aryan Yazdan Parast, Basim Azam, Naveed Akhtar
Title: DDB: Diffusion Driven Balancing to Address Spurious Correlations
Abstract:
Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods. Our code is available at https://github.com/ArianYp/DDB.
中文: 提出的扩散驱动平衡技术利用文本到图像扩散模型生成打破伪相关性的训练样本,从而提升模型泛化能力,并在多个基准测试中实现了最优的最差组准确率。
English: The proposed Diffusion Driven Balancing technique uses text-to-image diffusion models to generate training samples that break spurious correlations, improving model generalization and achieving superior worst-group accuracy across benchmarks.

Authors:Ting Sun, Cheng Cui, Yuning Du, Yi Liu
Title: PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
Abstract:
Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .
中文: PP-DocLayout通过三种不同规模的模型解决了文档布局分析中的泛化与效率难题,在多种格式下实现了高精度和实时处理能力。
English: PP-DocLayout introduces three scalable models that overcome generalization and efficiency challenges in document layout analysis, achieving high precision and real-time performance across diverse formats.

Authors:Yongli Xiang, Ziming Hong, Lina Yao, Dadong Wang, Tongliang Liu
Title: Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
Abstract:
Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.
中文: 本研究提出JailNTL攻击方法,通过数据伪装在测试阶段绕过黑盒非可转移学习(NTL)的安全屏障,无需修改模型权重即可将未授权域准确率最高提升55.7%,显著超越现有白盒攻击效果。
English: This study introduces JailNTL, a novel black-box attack that bypasses non-transferable learning (NTL) security by disguising unauthorized data through input-level and output-level transformations, achieving up to 55.7% accuracy improvement without modifying model weights.

Authors:Sheng Wang, Pengan Chen, Jingqi Zhou, Qintong Li, Jingwei Dong, Jiahui Gao, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu
Title: TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning
Abstract:
Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases, and low-variation prompts, resulting in limited diversity and biased distributions with the increase of data scales. To tackle this challenge, we introduce TREESYNTH, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness before synthesizing samples within each atomic subspace. This globally dividing-and-synthesizing method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the rebalancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently demonstrate the superior data diversity, model performance, and robust scalability of TREESYNTH compared to both human-crafted datasets and peer data synthesis methods, with an average performance gain reaching 10%. Besides, the consistent improvements of TREESYNTH-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.
Chinese: TREESYNTH是一种基于树引导空间划分的数据合成方法,通过将数据空间递归分割为原子子空间来生成多样平衡的数据集,其性能平均提升10%,显著优于现有方法。
English: TREESYNTH is a novel tree-guided data synthesis method that partitions the data space into atomic subspaces to generate diverse and balanced datasets, significantly outperforming existing approaches with a 10% average performance gain.

Authors:Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, Nicola Strisciuglio
Title: Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at https://github.com/BerasiDavide/vlm_image_compositionality.
中文摘要:视觉语言模型通过提出的"测地可分解嵌入"框架,在视觉表征中自动形成了组合推理能力,并在组合分类和群体鲁棒性任务中展现出卓越性能。
English Summary: Vision-Language Models automatically develop compositional reasoning in visual representations through the proposed Geodesically Decomposable Embeddings framework, which demonstrates superior performance in compositional classification and group robustness tasks.

Authors:Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
Title: Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Abstract:
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文: 本研究通过分析326个模型的九个质量维度填补了深度神经网络综合评估的空白,发现视觉语言模型和自监督学习能提升鲁棒性与公平性,并推出QUBA评分体系实现定制化模型推荐。
English: This study addresses the gap in evaluating deep neural networks' overall quality by analyzing nine dimensions across 326 models, revealing that vision-language models and self-supervised learning enhance robustness and fairness, and introduces the QUBA score for tailored model recommendations.

Authors:Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, Qi Wu
Title: Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
Abstract:
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/predicir.
中文: PrediCIR提出了一种基于预测的映射网络,通过自适应推断参考图像中缺失的视觉内容来提升零样本组合图像检索效果,在六项任务中均实现了最先进的性能。
English: PrediCIR introduces a prediction-based mapping network that adaptively infers missing visual content in reference images to enhance zero-shot composed image retrieval, achieving state-of-the-art performance across six tasks.

Authors:Johan Edstedt, André Mateus, Alberto Jaenal
Title: ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Abstract:
Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets. To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code is available at https://github.com/EricssonResearch/ColabSfM
中文: 本文提出ColabSfM解决分布式运动恢复结构重建的配准难题,通过建立可扩展的点云配准任务、开发新型数据集生成流程以及设计名为RefineRoITr的神经优化器,显著提升了配准性能。
English: This paper introduces ColabSfM to address the challenge of registering distributed SfM reconstructions by proposing a scalable point cloud registration task, a novel dataset generation pipeline, and a neural refiner called RefineRoITr that significantly enhances registration performance.

Authors:Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, Matthieu Cord
Title: Halton Scheduler For Masked Generative Image Transformer
Abstract:
Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT's token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token's position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at https://github.com/valeoai/Halton-MaskGIT.
中文: MaskGIT的原始置信度调度器被基于准随机哈尔顿序列的调度器取代,通过均匀选择标记位置来减少采样误差,无需重新训练即可提升图像质量与多样性并简化超参数调整。
English: MaskGIT's original Confidence scheduler is replaced by a Halton scheduler that uses quasi-random sequences to uniformly select tokens, improving image quality and diversity while simplifying hyper-parameter tuning without retraining.

Authors:Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci
Title: Superpowering Open-Vocabulary Object Detectors for X-ray Vision
Abstract:
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.
中文:RAXO是一种无需训练的框架,通过构建增强的视觉描述符将RGB物体检测器应用于X射线安检,无需标注数据即可显著提升检测性能。
English: RAXO is a training-free framework that adapts RGB object detectors for X-ray security screening by creating enriched visual descriptors, achieving significant performance improvements without labeled data.

Authors:Fangyijie Wang, Kathleen M. Curran, Guénolé Silvestre
Title: Semi-supervised Cervical Segmentation on Ultrasound by A Dual Framework for Neural Networks
Abstract:
Accurate segmentation of ultrasound (US) images of the cervical muscles is crucial for precision healthcare. The demand for automatic computer-assisted methods is high. However, the scarcity of labeled data hinders the development of these methods. Advanced semi-supervised learning approaches have displayed promise in overcoming this challenge by utilizing labeled and unlabeled data. This study introduces a novel semi-supervised learning (SSL) framework that integrates dual neural networks. This SSL framework utilizes both networks to generate pseudo-labels and cross-supervise each other at the pixel level. Additionally, a self-supervised contrastive learning strategy is introduced, which employs a pair of deep representations to enhance feature learning capabilities, particularly on unlabeled data. Our framework demonstrates competitive performance in cervical segmentation tasks. Our codes are publicly available on https://github.com/13204942/SSL\_Cervical\_Segmentation.
Chinese: 本研究提出了一种新颖的半监督学习框架,通过整合双神经网络进行像素级交叉监督和自监督对比学习,在宫颈肌肉超声图像分割中展现出优异性能,有效解决了标注数据稀缺的问题。
English: This study introduces a novel semi-supervised learning framework that combines dual neural networks for pixel-level cross-supervision and self-supervised contrastive learning to achieve competitive performance in cervical muscle ultrasound image segmentation, addressing the challenge of limited labeled data.

Authors:Qinghe Ma, Jian Zhang, Zekun Li, Lei Qi, Qian Yu, Yinghuan Shi
Title: Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation
Abstract:
Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains. In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements. In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31\% on the Prostate dataset. Our code is available at https://github.com/MQinghe/SynFoC .
中文:SynFoC框架通过协同训练基础模型和传统模型,纠正过度自信的错误预测,并利用一致性-分歧正则化提升半监督医学图像分割的效果,在多个数据集上实现了显著性能提升。
English: The SynFoC framework synergistically combines a foundation model and a conventional model to correct overconfident mispredictions and enhance semi-supervised medical image segmentation, achieving significant performance improvements across multiple datasets.

Authors:Tobias Brudermueller, Elgar Fleisch, Marina González Vayá, Thorsten Staake
Title: HEAPO -- An Open Dataset for Heat Pump Optimization with Smart Electricity Meter Data and On-Site Inspection Protocols
Abstract:
Heat pumps are essential for decarbonizing residential heating but consume substantial electrical energy, impacting operational costs and grid demand. Many systems run inefficiently due to planning flaws, operational faults, or misconfigurations. While optimizing performance requires skilled professionals, labor shortages hinder large-scale interventions. However, digital tools and improved data availability create new service opportunities for energy efficiency, predictive maintenance, and demand-side management. To support research and practical solutions, we present an open-source dataset of electricity consumption from 1,408 households with heat pumps and smart electricity meters in the canton of Zurich, Switzerland, recorded at 15-minute and daily resolutions between 2018-11-03 and 2024-03-21. The dataset includes household metadata, weather data from 8 stations, and ground truth data from 410 field visit protocols collected by energy consultants during system optimizations. Additionally, the dataset includes a Python-based data loader to facilitate seamless data processing and exploration.
中文: 热泵对住宅供暖脱碳至关重要,但因安装和运行问题导致效率低下,为此发布了来自瑞士1408户家庭的开源数据集,以支持能源优化和预测性维护的研究。
English: Heat pumps are vital for decarbonizing home heating but face efficiency challenges due to installation and operational issues, prompting the release of an open-source dataset from 1,408 Swiss households to aid research in energy optimization and predictive maintenance.

Authors:Xu Zhang, Hao Zhou, Haoming Qin, Xiaobin Lu, Jiaxing Yan, Guanzhong Wang, Zeyu Chen, Yi Liu
Title: Enabling Versatile Controls for Video Diffusion Models
Abstract:
Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.
中文摘要:VCtrl是一种新颖框架,通过通用条件模块整合多种控制信号,实现对预训练视频扩散模型的细粒度控制,有效提升了生成质量与可控性。
English Summary: VCtrl is a novel framework that enables fine-grained control over pre-trained video diffusion models by integrating diverse control signals through a generalizable conditional module, enhancing both controllability and generation quality.

Authors:Xiaofeng Mao, Yuefeng Chen, Rong Zhang, Hui Xue, Zhao Li, Hang Su
Title: EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision
Abstract:
Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines "in the wild". In order to break these obstructions and facilitate the research of model robustness, we develop EasyRobust, a comprehensive and easy-to-use toolkit for training, evaluation and analysis of robust vision models. EasyRobust targets at two types of robustness: 1) Adversarial robustness enables the model to defense against malicious inputs crafted by worst-case perturbations, also known as adversarial examples; 2) Non-adversarial robustness enhances the model performance on natural test images with corruptions or distribution shifts. Thorough benchmarks on image classification enable EasyRobust to provide an accurate robustness evaluation on vision models. We wish our EasyRobust can help for training practically-robust models and promote academic and industrial progress in closing the gap between human and machine vision. Codes and models of EasyRobust have been open-sourced in https://github.com/alibaba/easyrobust.
Chinese: 深度神经网络在计算机视觉任务中表现出色,但在对抗性攻击和数据分布变化下缺乏人类感知的鲁棒性,为此我们开发了EasyRobust开源工具包,用于训练和评估稳健视觉模型以缩小人机视觉差距。
English: Deep neural networks excel in computer vision but lack human-level robustness against adversarial attacks and data shifts, prompting the development of EasyRobust, an open-source toolkit for training and evaluating robust vision models to bridge this gap.

Authors:Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
Title: Distilling Monocular Foundation Model for Fine-grained Depth Completion
Abstract:
Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C
Chinese: 本文提出了一种两阶段知识蒸馏框架,利用单目基础模型为深度补全提供密集监督,无需真实深度数据即可在KITTI基准测试中实现最优性能。
English: This paper introduces a two-stage knowledge distillation framework that leverages monocular foundation models to provide dense supervision for depth completion, achieving state-of-the-art performance on the KITTI benchmark without requiring ground-truth depth data.

Authors:Wei Zhang, Mengting Ma, Yizhen Jiang, Rongrong Lian, Zhenkai Wu, Kangning Cui, Xiaowen Ma
Title: Center-guided Classifier for Semantic Segmentation of Remote Sensing Images
Abstract:
Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.
中文摘要:CenterSeg模型通过多原型、格拉斯曼流形监督和可解释性策略,解决了遥感图像语义分割中类内方差大的难题,在实现优越性能的同时兼具简洁性和兼容性。
English Summary: The CenterSeg model addresses the challenges of large intraclass variance in remote sensing image segmentation by introducing multiple prototypes, Grassmann manifold supervision, and interpretability strategies, achieving superior performance with simplicity and compatibility.

Authors:Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham, Abdelmalik Taleb-Ahmed, Yassin El Hillali
Title: PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
Abstract:
Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .
中文:提出的PE-CLIP框架通过参数高效的适配器和多模态提示,解决了视觉语言模型在动态表情识别中的效率问题,以更少计算资源实现了优越性能。
English: The proposed PE-CLIP framework addresses inefficiencies in adapting Vision-Language Models for Dynamic Facial Expression Recognition by introducing parameter-efficient adapters and multi-modal prompts, achieving competitive accuracy with reduced computational resources.

Authors:Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun
Title: TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment
Abstract:
Video Large Language Models (Video LLMs) have achieved significant success by leveraging a two-stage paradigm: pretraining on large-scale video-text data for vision-language alignment, followed by supervised fine-tuning (SFT) for task-specific capabilities. However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and reliance on the next-token prediction paradigm during training. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances Video LLMs' temporal reasoning capabilities through Direct Preference Optimization (DPO). To facilitate this, we introduce an automated preference data generation pipeline that systematically constructs preference pairs by selecting videos that are rich in temporal information, designing video-specific perturbation strategies, and finally evaluating model responses on clean and perturbed video inputs. Our temporal alignment features two key innovations: curriculum learning which that progressively increases perturbation difficulty to improve model robustness and adaptability; and "Pre-SFT Alignment'', applying preference optimization before instruction tuning to prioritize fine-grained temporal comprehension. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization. Our findings highlight our TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs. Code is available at https://github.com/lscpku/TEMPLE.
Chinese: 提出的TEMPLE框架通过自动生成偏好数据和基于课程学习的对齐方法,有效增强了视频大语言模型的时序推理能力,仅用少量数据即可在多个基准测试中显著提升性能。
English: The proposed TEMPLE framework enhances Video LLMs' temporal reasoning through automated preference data generation and curriculum-based alignment, significantly improving performance across benchmarks with minimal data.

Authors:Sirui Chen, Shen Han, Jiawei Chen, Binbin Hu, Sheng Zhou, Gang Wang, Yan Feng, Chun Chen, Can Wang
Title: Rankformer: A Graph Transformer for Recommendation based on Ranking Objective
Abstract:
Recommender Systems (RS) aim to generate personalized ranked lists for each user and are evaluated using ranking metrics. Although personalized ranking is a fundamental aspect of RS, this critical property is often overlooked in the design of model architectures. To address this issue, we propose Rankformer, a ranking-inspired recommendation model. The architecture of Rankformer is inspired by the gradient of the ranking objective, embodying a unique (graph) transformer architecture -- it leverages global information from all users and items to produce more informative representations and employs specific attention weights to guide the evolution of embeddings towards improved ranking performance. We further develop an acceleration algorithm for Rankformer, reducing its complexity to a linear level with respect to the number of positive instances. Extensive experimental results demonstrate that Rankformer outperforms state-of-the-art methods. The code is available at https://github.com/StupidThree/Rankformer.
中文: Rankformer是一种创新的推荐模型,通过将排序目标融入其变换器架构来提升个性化排序性能,并以线性计算复杂度实现了卓越的效果。
English: Rankformer is a novel recommendation model that incorporates ranking objectives into its transformer architecture to enhance personalized ranking performance, achieving superior results with linear computational complexity.

Authors:Linxi Liang, Jing Gong, Mingwei Liu, Chong Wang, Guangsheng Ou, Yanlin Wang, Xin Peng, Zibin Zheng
Title: RustEvo^2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation
Abstract:
Large Language Models (LLMs) have become pivotal tools for automating code generation in software development. However, these models face significant challenges in producing version-aware code for rapidly evolving languages like Rust, where frequent Application Programming Interfaces (API) changes across versions lead to compatibility issues and correctness errors. Existing benchmarks lack systematic evaluation of how models navigate API transitions, relying on labor-intensive manual curation and offering limited version-specific insights. To address this gap, we present RustEvo, a novel framework for constructing dynamic benchmarks that evaluate the ability of LLMs to adapt to evolving Rust APIs. RustEvo automates dataset creation by synthesizing 588 API changes (380 from Rust standard libraries, 208 from 15 third-party crates) into programming tasks mirroring real-world challenges. These tasks cover four API evolution categories: Stabilizations, Signature Changes, Behavioral Changes, and Deprecations, reflecting their actual distribution in the Rust ecosystem. Experiments on state-of-the-art (SOTA) LLMs reveal significant performance variations: models achieve a 65.8% average success rate on stabilized APIs but only 38.0% on behavioral changes, highlighting difficulties in detecting semantic shifts without signature alterations. Knowledge cutoff dates strongly influence performance, with models scoring 56.1% on before-cutoff APIs versus 32.5% on after-cutoff tasks. Retrieval-Augmented Generation (RAG) mitigates this gap, improving success rates by 13.5% on average for APIs released after model training. Our findings underscore the necessity of our evolution-aware benchmarks to advance the adaptability of LLMs in fast-paced software ecosystems. The framework and the benchmarks are publicly released at https://github.com/SYSUSELab/RustEvo.
中文摘要:RustEvo是一个创新框架,通过自动化构建动态基准来评估大语言模型对Rust API演变的适应能力,揭示了模型在不同API变更类别中的显著性能差异,并证明检索增强生成技术能够有效缓解知识截止日期带来的限制。
English Summary: RustEvo is a novel framework that automates the creation of dynamic benchmarks to evaluate how well Large Language Models adapt to evolving Rust APIs, revealing significant performance variations across different API change categories and demonstrating that Retrieval-Augmented Generation can mitigate knowledge cutoff limitations.

Authors:Omar Coser, Christian Tamantini, Matteo Tortora, Leonardo Furia, Rosa Sicilia, Loredana Zollo, Paolo Soda
Title: Deep Learning for Human Locomotion Analysis in Lower-Limb Exoskeletons: A Comparative Study
Abstract:
Wearable robotics for lower-limb assistance have become a pivotal area of research, aiming to enhance mobility for individuals with physical impairments or augment the performance of able-bodied users. Accurate and adaptive control systems are essential to ensure seamless interaction between the wearer and the robotic device, particularly when navigating diverse and dynamic terrains. Despite the recent advances in neural networks for time series analysis, no attempts have been directed towards the classification of ground conditions, categorized into five classes and subsequently determining the ramp's slope and stair's height. In this respect, this paper presents an experimental comparison between eight deep neural network backbones to predict high-level locomotion parameters across diverse terrains. All the models are trained on the publicly available CAMARGO 2021 dataset. IMU-only data equally or outperformed IMU+EMG inputs, promoting a cost-effective and efficient design. Indeeds, using three IMU sensors, the LSTM achieved high terrain classification accuracy (0.94 +- 0.04) and precise ramp slope (1.95 +- 0.58°) and the CNN-LSTM a stair height (15.65 +- 7.40 mm) estimations. As a further contribution, SHAP analysis justified sensor reduction without performance loss, ensuring a lightweight setup. The system operates with ~2 ms inference time, supporting real-time applications. The code is code available at https://github.com/cosbidev/Human-Locomotion-Identification.
中文摘要:本文通过对比八种深度神经网络,利用IMU数据预测不同地形下的运动参数,实现了高精度的地形分类及坡道斜度与台阶高度测量,其轻量化系统具备实时运行能力。
English Summary: This paper compares eight deep neural networks for predicting locomotion parameters across various terrains using IMU data, achieving high accuracy in terrain classification and precise estimations of ramp slopes and stair heights with a lightweight, real-time capable system.

Authors:Dongseob Kim, Hyunjung Shim
Title: Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
Abstract:
Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method's superiority over existing techniques across diverse datasets. The code is available at https://github.com/k0u-id/CCD.
Chinese: 本研究提出了一种名为分类器引导的CLIP蒸馏(CCD)的新方法,通过利用类激活映射引导的多个局部视图并消除CLIP预测偏差,在无需额外标注的情况下显著提升了多标签图像分类性能,在多个数据集上验证了其优越性。
English: This study introduces Classifier-guided CLIP Distillation (CCD), a novel method that enhances multi-label image classification by leveraging multiple local views guided by Class Activation Mapping and debiasing CLIP predictions, achieving superior performance across diverse datasets without requiring additional labels.

Authors:Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee
Title: Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Abstract:
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
Chinese: 知识蒸馏在预计算教师模型输出时可高效传递大语言模型知识,但缓存Top-K概率等简单方法会产生偏差,导致性能不佳;我们提出的随机采样知识蒸馏方法提供无偏估计且仅需稀疏对数,能在模型规模从3亿到30亿范围内实现更快训练并保持竞争力。
English: Knowledge distillation can efficiently transfer knowledge from large language models if teacher logits are pre-computed, but naive methods like caching Top-K probabilities introduce bias, leading to suboptimal results; our proposed Random Sampling Knowledge Distillation offers unbiased estimates with sparse logits, enabling faster training and competitive performance across model sizes.

Authors:Xiyue Guo, Jiarui Hu, Junjie Hu, Hujun Bao, Guofeng Zhang
Title: SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
Abstract:
Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on https://github.com/gxytcrc/SGFormer.
中文:本文提出首个卫星-地面协同的场景语义补全框架SGFormer,通过双分支架构和自适应策略整合卫星与地面视角,在基准数据集上实现了最优性能。
English: This paper introduces SGFormer, the first satellite-ground cooperative framework for scene semantic completion, which utilizes dual-branch architecture and adaptive strategies to unify satellite and ground views, achieving state-of-the-art performance on benchmark datasets.

Authors:Li Zhang, Longxi Gao, Mengwei Xu
Title: Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study
Abstract:
Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at https://github.com/LlamaTouch/VLM-Reasoning-Traces.
Chinese: 推理增强的视觉语言模型在移动端GUI智能体中表现不一,在交互环境中达到最优性能,但在静态基准测试中仅带来有限提升甚至性能下降,表明需改进基准测试并自适应调用推理能力。
English: Reasoning-enhanced vision-language models show mixed results in mobile GUI agents, achieving state-of-the-art performance in interactive environments but offering only marginal gains or even performance degradation in static benchmarks, highlighting the need for improved benchmarks and adaptive reasoning invocation.

Authors:Mengsong Wu, Tong Zhu, Han Han, Xiang Zhang, Wenbiao Shao, Wenliang Chen
Title: Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
Abstract:
Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning with a huge and flexible tool pool which may contain unseen tools. Especially, to validate the effectiveness of our approach in the massive unseen tool scenario, we construct a new dataset SimpleToolQuestions. We conduct experiments on two numerical reasoning benchmarks (GSM8K-XL and FuncQA) and two knowledge-based question answering benchmarks (KAMEL and SimpleToolQuestions). Experimental results show that our approach performs better than the baseline. We also identify dimensions of the model output that are critical in tool selection, enhancing the model interpretability. Our code and data are available at: https://github.com/fairyshine/Chain-of-Tools .
中文: Chain-of-Tools是一种新的工具学习方法,利用冻结大语言模型的强大语义表示能力,通过思维链推理处理海量可见及未见工具,在多个基准测试中优于基线方法并增强了模型可解释性。
English: Chain-of-Tools is a novel tool learning method that leverages frozen LLMs' semantic capabilities to handle both seen and unseen tools through CoT reasoning, outperforming baselines on multiple benchmarks while improving interpretability.

Authors:Massa Baali, Xiang Li, Hao Chen, Syed Abdul Hannan, Rita Singh, Bhiksha Raj
Title: CAARMA: Class Augmentation with Adversarial Mixup Regularization
Abstract:
Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8\% over all baseline models. The code is available at: https://github.com/massabaali7/CAARMA/
中文摘要:CAARMA是一个通过嵌入空间混合生成合成类别并采用对抗性优化确保真实性的类别增强框架,在说话人验证任务中相比基线模型实现了8%的性能提升。
English Summary: CAARMA is a class augmentation framework that enhances speaker verification by generating synthetic classes through embedding space mixing and adversarial refinement, achieving an 8% performance improvement over baseline models.

Authors:Xuan Shen, Weize Ma, Jing Liu, Changdi Yang, Rui Ding, Quanyi Wang, Henghui Ding, Wei Niu, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
Title: QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
Abstract:
Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications. However, deploying accurate depth estimation models on resource-limited edge devices, especially Application-Specific Integrated Circuits (ASICs), is challenging due to the high computational and memory demands. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. Furthermore, we design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency. Experimental results demonstrate that our framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability. Code: https://github.com/shawnricecake/quart-depth
中文摘要:QuartDepth框架通过后训练量化将单目深度估计模型压缩至4位精度,结合激活优化和权重重建技术,在保持精度的同时显著提升ASIC芯片上的推理速度和能效,弥合了高性能深度估计与边缘设备实际应用之间的差距。
English Summary: QuartDepth is a framework that employs post-training quantization to reduce Monocular Depth Estimation models to 4-bit precision for efficient deployment on ASICs, incorporating activation polishing and weight reconstruction techniques to maintain accuracy while enhancing computational speed and energy efficiency.

Authors:Jinlong Li, Cristiano Saltori, Fabio Poiesi, Nicu Sebe
Title: Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Abstract:
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.
中文: 现有开放词汇3D场景理解方法因依赖单一视觉语言模型而受限,本文提出的CUA-O3D模型通过集成多基础模型并引入不确定性感知特征蒸馏技术,有效融合语义先验与几何知识,在场景分割和跨域对齐任务中实现了突破性性能。
English: Recent methods for open-vocabulary 3D scene understanding have been limited by relying on a single vision-language model, but our proposed CUA-O3D model integrates multiple foundation models and introduces uncertainty-aware feature distillation to enhance semantic and geometric capabilities, achieving superior performance in segmentation and cross-domain alignment.

Authors:Tianze Luo, Xingchen Miao, Wenbo Duan
Title: WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
Abstract:
Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.
中文: WaveFM通过采用梅尔谱条件先验和辅助损失来提升音频质量,并结合定制的一致性蒸馏方法,实现了在单步推理中快速生成高质量波形。
English: WaveFM enhances diffusion vocoders by using a mel-conditioned prior and auxiliary losses to improve audio quality and a tailored consistency distillation for faster, single-step waveform generation.

Authors:Martin Kostelník, Karel Beneš, Michal Hradiš
Title: TextBite: A Historical Czech Document Dataset for Logical Page Segmentation
Abstract:
Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.
中文: 本研究提出了一种纯图像域的逻辑页面分割方法,无需依赖OCR,并发布了包含历史捷克文献的TextBite数据集、基线方法及评估框架。
English: This study introduces a purely image-based approach to logical page segmentation, avoiding OCR dependency, and presents the TextBite dataset of historical Czech documents with baseline methods and an evaluation framework.

Authors:Alejandro Ariza-Casabona, Nikos Kanakaris, Daniele Malitesta
Title: ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation)
Abstract:
Relational deep learning (RDL) settles among the most exciting advances in machine learning for relational databases, leveraging the representational power of message passing graph neural networks (GNNs) to derive useful knowledge and run predicting tasks on tables connected through primary-to-foreign key links. The RDL paradigm has been successfully applied to recommendation lately, through its most recent representative deep learning architecture namely, ContextGNN. While acknowledging ContextGNN's improved performance on real-world recommendation datasets and tasks, preliminary tests for the more traditional static link prediction task (aka personalized item recommendation) on the popular Amazon Book dataset have demonstrated how ContextGNN has still room for improvement compared to other state-of-the-art GNN-based recommender systems. To this end, with this paper, we integrate ContextGNN within Elliot, a popular framework for reproducibility and benchmarking analyses, counting around 50 state-of-the-art recommendation models from the literature to date. On such basis, we run preliminary experiments on three standard recommendation datasets and against six state-of-the-art GNN-based recommender systems, confirming similar trends to those observed by the authors in their original paper. The code is publicly available on GitHub: https://github.com/danielemalitesta/Rel-DeepLearning-RecSys.
Chinese: 关系深度学习,尤其是ContextGNN,在推荐系统中显示出潜力,但在静态链接预测任务上仍需改进,这一点通过将其整合到Elliot框架中并与基于GNN的先进模型在标准数据集上的比较得到了验证。
English: Relational deep learning, particularly ContextGNN, shows promise in recommendation systems but requires further enhancement for static link prediction tasks, as demonstrated through integration with the Elliot framework and comparisons with other GNN-based models on standard datasets.

Authors:Moshiur Rahman Tonmoy, Md. Mithun Hossain, Nilanjan Dey, M. F. Mridha
Title: MobilePlantViT: A Mobile-friendly Hybrid ViT for Generalized Plant Disease Image Classification
Abstract:
Plant diseases significantly threaten global food security by reducing crop yields and undermining agricultural sustainability. AI-driven automated classification has emerged as a promising solution, with deep learning models demonstrating impressive performance in plant disease identification. However, deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints, highlighting the need for lightweight, accurate solutions for accessible smart agriculture systems. To address this, we propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification, which optimizes resource efficiency while maintaining high performance. Extensive experiments across diverse plant disease datasets of varying scales show our model's effectiveness and strong generalizability, achieving test accuracies ranging from 80% to over 99%. Notably, with only 0.69 million parameters, our architecture outperforms the smallest versions of MobileViTv1 and MobileViTv2, despite their higher parameter counts. These results underscore the potential of our approach for real-world, AI-powered automated plant disease classification in sustainable and resource-efficient smart agriculture systems. All codes will be available in the GitHub repository: https://github.com/moshiurtonmoy/MobilePlantViT
中文摘要:提出的MobilePlantViT模型通过轻量级视觉变换器架构解决了植物病害分类中的计算难题,仅用69万参数即可实现高达99%的准确率,为可持续智慧农业提供了高效解决方案。
English Summary: The proposed MobilePlantViT model addresses computational challenges in plant disease classification by combining a lightweight Vision Transformer architecture with high accuracy, achieving up to 99% performance using only 0.69 million parameters for sustainable smart agriculture.

Authors:Songqiao Hu, Zidong Wang, Zeyi Liu, Zhen Shen, Xiao He
Title: SafeLink: Safety-Critical Control Under Dynamic and Irregular Unsafe Regions
Abstract:
Control barrier functions (CBFs) provide a theoretical foundation for safety-critical control in robotic systems. However, most existing methods rely on the analytical expressions of unsafe state regions, which are often impractical for irregular and dynamic unsafe regions. This paper introduces SafeLink, a novel CBF construction method based on cost-sensitive incremental random vector functional-link (RVFL) neural networks. By designing a valid cost function, SafeLink assigns different sensitivities to safe and unsafe state points, thereby eliminating false negatives in classification of unsafe state points. Furthermore, an incremental update theorem is established, enabling precise real-time adaptation to changes in unsafe regions. An analytical expression for the gradient of SafeLink is also derived to facilitate control input computation. The proposed method is validated on the endpoint position control task of a nonlinear two-link manipulator. Experimental results demonstrate that the method effectively learns the unsafe regions and rapidly adapts as these regions change, achieving an update speed significantly faster than comparison methods, while safely reaching the target position. The source code is available at https://github.com/songqiaohu/SafeLink.
中文:本文提出SafeLink方法,通过基于代价敏感增量随机向量函数链接神经网络的新型控制屏障函数构建方案,能实时精准学习并快速适应动态危险区域,在机械臂控制实验中展现出远超对比方法的更新速度与安全保障。
English: This paper presents SafeLink, a novel control barrier function method using cost-sensitive incremental neural networks to accurately learn and rapidly adapt to dynamic unsafe regions in real-time, validated on a robotic manipulator with superior speed and safety performance.

Authors:Felix Chen, Hangjie Yuan, Yunqiu Xu, Tao Feng, Jun Cen, Pengwei Liu, Zeying Huang, Yi Yang
Title: MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
Abstract:
Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.
中文: 多模态大语言模型在视觉数学解题中因图表感知能力不足而表现不佳,为此提出的MathFlow模块化流程将感知与推理分离,有效提升了各类推理模型的性能。
English: Multimodal Large Language Models struggle with visual mathematical problem-solving due to poor diagram perception, prompting the development of MathFlow, a modular pipeline that separates perception and inference stages to significantly enhance performance.

Authors:Xinyan Chen, Jiaxin Ge, Hongming Dai, Qiang Zhou, Qiuxuan Feng, Jingtong Hu, Yizhou Wang, Jiaming Liu, Shanghang Zhang
Title: EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?
Abstract:
Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents' tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents' empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents' empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents' empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at https://github.com/xinyan-cxy/EmpathyAgent.
中文: 该摘要介绍了EmpathyAgent,这是首个通过一万个多模态场景和专门评估套件来测评和增强具身智能体共情行为的基准,既揭示了现有模型的不足,也展示了通过Llama3-8B训练实现改进的潜力。
English: This abstract introduces EmpathyAgent, the first benchmark designed to evaluate and enhance empathetic behaviors in embodied agents through 10,000 multimodal scenarios and a specialized evaluation suite, revealing current models' limitations while demonstrating potential improvements via training with Llama3-8B.

Authors:Shuo Huang, Muhammad Umair Nasir, Steven James, Julian Togelius
Title: Word2Minecraft: Generating 3D Game Levels through Large Language Models
Abstract:
We present Word2Minecraft, a system that leverages large language models to generate playable game levels in Minecraft based on structured stories. The system transforms narrative elements-such as protagonist goals, antagonist challenges, and environmental settings-into game levels with both spatial and gameplay constraints. We introduce a flexible framework that allows for the customization of story complexity, enabling dynamic level generation. The system employs a scaling algorithm to maintain spatial consistency while adapting key game elements. We evaluate Word2Minecraft using both metric-based and human-based methods. Our results show that GPT-4-Turbo outperforms GPT-4o-Mini in most areas, including story coherence and objective enjoyment, while the latter excels in aesthetic appeal. We also demonstrate the system' s ability to generate levels with high map enjoyment, offering a promising step forward in the intersection of story generation and game design. We open-source the code at https://github.com/JMZ-kk/Word2Minecraft/tree/word2mc_v0
中文摘要:Word2Minecraft是一个利用大语言模型将结构化故事转化为可定制复杂度的可玩《我的世界》关卡的系统,其中GPT-4-Turbo在故事连贯性和游戏目标趣味性方面表现更优,相关代码已开源发布。
English Summary: Word2Minecraft is a system that uses large language models to convert structured stories into playable Minecraft levels with customizable complexity, demonstrating superior performance with GPT-4-Turbo in story coherence and player enjoyment while making the code publicly available.

Authors:Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball
Title: EEG-CLIP : Learning EEG representations from natural language descriptions
Abstract:
Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip
中文: 本研究提出EEG-CLIP对比学习框架,通过将脑电图数据与临床文本映射到共享嵌入空间,实现了多功能的零样本和少样本解码,为更广泛的脑电分析应用开辟了新途径。
English: This study introduces EEG-CLIP, a contrastive learning framework that aligns EEG data with clinical text in a shared embedding space, enabling versatile zero-shot and few-shot decoding for broader EEG analysis applications.

Authors:Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Limin Han, Jiaojiao Zhao, Junting Guo, Zhenhong Long, Shu Yang, Meijuan An, Beibei Huang, Rongjia Du, Ning Wang, Kai Wang, Shiguo Lian
Title: Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts
Abstract:
DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.
中文: DeepSeek-R1存在显著的安全缺陷,但本研究通过针对性增强在保持推理能力的同时大幅提升了其安全性,并将安全增强模型开源以供后续研究。
English: DeepSeek-R1 exhibits significant safety vulnerabilities, particularly with harmful prompts, but this study enhances its safety through targeted improvements while preserving reasoning capabilities, with the safety-enhanced models being open-sourced for further research.

Authors:Rishabh Vishwakarma, Caroline Brophy, Catherine Hurley
Title: PieGlyph: An R package for creating axis invariant pie-glyphs for 2d plots
Abstract:
Effective visualisation of multidimensional data is crucial for generating insights. Glyph-based visualisations, which encode data dimensions onto multiple visual channels such as colour, shape, and size, provide an effective means of representing complex datasets. Pie-chart glyphs (pie-glyphs) are one such approach, where multiple data attributes are mapped to slices within a pie chart. This paper introduces the PieGlyph R package, which enables users to overlay any 2D plot with axis-invariant pie-glyphs, offering a compact and intuitive representation of multidimensional data. Unlike existing R packages such as scatterpie or ggforce, PieGlyph generates pie-glyphs independently of the plot axes by employing a nested coordinate system, ensuring they remain circular regardless of changes to the underlying coordinate system. This enhances interpretability, particularly in when visualising spatial data, as users can select the most appropriate map projection without distorting the glyphs' shape. Pie-glyphs are also particularly well-suited for visualising compositional data, where there is a natural sum-to-one constraint on the data attributes. PieGlyph is developed under the Grammar of Graphics paradigm using the ggplot2 framework and supports the generation of interactive pie-glyphs through the ggiraph package. Designed to integrate seamlessly with all features and extensions offered by ggplot2 and ggiraph, PieGlyph provides users with full flexibility in customising every aspect of the visualisation. This paper outlines the conceptual framework of PieGlyph, compares it with existing alternatives, and demonstrates its applications through example visualisations.
中文:PieGlyph R 包引入了轴不变饼图符号用于多维数据可视化,确保圆形符号在不同坐标系中保持无失真,并提升空间和成分数据的可解释性。
English: The PieGlyph R package introduces axis-invariant pie-glyphs for multidimensional data visualization, ensuring circular glyphs remain undistorted across coordinate systems and enhancing interpretability for spatial and compositional data.

Authors:Marc R. Schlichting, Vale Rasmussen, Heba Alazzeh, Houjun Liu, Kiana Jafari, Amelia F. Hardy, Dylan M. Asmar, Mykel J. Kochenderfer
Title: LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool
Abstract:
In aviation emergencies, high-stakes decisions must be made in an instant. Pilots rely on quick access to precise, context-specific information -- an area where emerging tools like large language models (LLMs) show promise in providing critical support. This paper introduces LeRAAT, a framework that integrates LLMs with the X-Plane flight simulator to deliver real-time, context-aware pilot assistance. The system uses live flight data, weather conditions, and aircraft documentation to generate recommendations aligned with aviation best practices and tailored to the particular situation. It employs a Retrieval-Augmented Generation (RAG) pipeline that extracts and synthesizes information from aircraft type-specific manuals, including performance specifications and emergency procedures, as well as aviation regulatory materials, such as FAA directives and standard operating procedures. We showcase the framework in both a virtual reality and traditional on-screen simulation, supporting a wide range of research applications such as pilot training, human factors research, and operational decision support.
Chinese: 本文介绍了LeRAAT框架,它将大型语言模型与X-Plane飞行模拟器相结合,通过检索增强生成技术整合实时飞行数据、天气条件和航空文档,为飞行员提供实时、情境感知的辅助支持。
English: This paper presents LeRAAT, a framework that integrates large language models with the X-Plane flight simulator to provide real-time, context-aware pilot assistance by synthesizing live flight data, weather conditions, and aviation documentation through a Retrieval-Augmented Generation pipeline.

Authors:Pengzhou Cheng, Zheng Wu, Zongru Wu, Aston Zhang, Zhuosheng Zhang, Gongshen Liu
Title: OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents
Abstract:
Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59\%$\sim$87.29\% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at https://github.com/Wuzheng02/OS-Kairos.
中文摘要:OS-Kairos是一种自适应图形界面代理,通过预测每个交互步骤的置信度来自主决定执行操作或请求人工干预,在复杂场景中显著优于现有模型。
English Summary: OS-Kairos is an adaptive GUI agent that predicts confidence levels at each interaction step to determine when to act autonomously or seek human intervention, significantly outperforming existing models in complex scenarios.

Authors:Haidong Wang, Qia Shan, JianHua Zhang, PengFei Xiao, Ao Liu
Title: An Audio-Visual Fusion Emotion Generation Model Based on Neuroanatomical Alignment
Abstract:
In the field of affective computing, traditional methods for generating emotions predominantly rely on deep learning techniques and large-scale emotion datasets. However, deep learning techniques are often complex and difficult to interpret, and standardizing large-scale emotional datasets are difficult and costly to establish. To tackle these challenges, we introduce a novel framework named Audio-Visual Fusion for Brain-like Emotion Learning(AVF-BEL). In contrast to conventional brain-inspired emotion learning methods, this approach improves the audio-visual emotion fusion and generation model through the integration of modular components, thereby enabling more lightweight and interpretable emotion learning and generation processes. The framework simulates the integration of the visual, auditory, and emotional pathways of the brain, optimizes the fusion of emotional features across visual and auditory modalities, and improves upon the traditional Brain Emotional Learning (BEL) model. The experimental results indicate a significant improvement in the similarity of the audio-visual fusion emotion learning generation model compared to single-modality visual and auditory emotion learning and generation model. Ultimately, this aligns with the fundamental phenomenon of heightened emotion generation facilitated by the integrated impact of visual and auditory stimuli. This contribution not only enhances the interpretability and efficiency of affective intelligence but also provides new insights and pathways for advancing affective computing technology. Our source code can be accessed here: https://github.com/OpenHUTB/emotion}{https://github.com/OpenHUTB/emotion.
中文摘要:提出的视听融合类脑情感学习框架通过优化多模态情感特征融合,克服了传统深度学习方法的复杂性,实现了更轻量化、可解释的情感生成系统。
English Summary: The proposed Audio-Visual Fusion for Brain-like Emotion Learning (AVF-BEL) framework overcomes limitations of complex deep learning methods by creating a more interpretable and lightweight system that significantly enhances emotion generation through optimized multimodal fusion.

Authors:Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
Title: XAttention: Block Sparse Attention with Antidiagonal Scoring
Abstract:
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.
中文: XAttention是一种即插即用框架,通过使用注意力矩阵中反对角线值之和作为块重要性的代理,显著加速长上下文Transformer推理,在保持接近全注意力精度的同时实现高达13.5倍的计算加速。
English: XAttention is a plug-and-play framework that accelerates long-context Transformer inference by using the sum of antidiagonal values in attention matrices as a proxy for block importance, achieving near-full accuracy with up to 13.5x computational speedup.

Authors:Zigang Geng, Mengde Xu, Han Hu, Shuyang Gu
Title: Tokenize Image as a Set
Abstract:
This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at https://github.com/Gengzigang/TokenSet.
中文摘要:本文提出TokenSet这一创新图像生成方法,通过无序令牌集合和带固定和约束的双重转换机制,动态分配编码容量并提升生成质量。
English Summary: This paper introduces TokenSet, a novel image generation method using unordered token sets and a dual transformation mechanism with Fixed-Sum Discrete Diffusion to dynamically allocate coding capacity and enhance generation quality.

Authors:Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, Xia Hu
Title: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
中文: 本综述系统研究提升大语言模型推理效率的方法,通过将现有工作分类为模型优化、动态步骤缩减和提示增强三大方向,以解决冗长推理链导致的计算效率问题。
English: This survey systematically investigates methods to enhance reasoning efficiency in Large Language Models by categorizing approaches into model optimization, dynamic step reduction, and prompt-based strategies, while addressing computational inefficiencies from verbose reasoning chains.

Authors:Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Title: InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Abstract:
Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
中文摘要:InfiniteYou(InfU)框架通过引入InfuseNet进行身份特征注入和多阶段训练策略,解决了扩散变换器中身份保持与生成质量的难题,实现了顶尖性能并具备即插即用的兼容性。
English Summary: The InfiniteYou (InfU) framework addresses identity preservation and quality issues in Diffusion Transformers by introducing InfuseNet for feature injection and a multi-stage training strategy, achieving state-of-the-art performance with plug-and-play compatibility.

Authors:Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
Title: M3: 3D-Spatial MultiModal Memory
Abstract:
We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.
中文摘要:本文提出M3多模态记忆系统,通过结合3D高斯泼溅技术与基础模型构建可跨粒度渲染特征的多模态记忆,并针对特征存储计算瓶颈和特征对齐问题提出核心场景组件与高斯记忆注意力机制,在定量评估与机器人部署中验证了其有效性。
English Summary: The paper introduces M3, a multimodal memory system that uses 3D Gaussian Splatting and foundation models to create detailed scene representations while overcoming computational and feature alignment challenges through novel components like principal scene components and Gaussian memory attention.

Authors:SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim
Title: VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
Abstract:
Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
中文摘要:VerbDiff是一种新型文生图模型,通过解耦交互动词与高频锚点词并利用局部交互区域增强语义理解,无需额外条件即可生成更精准的人物与物体交互图像,在HICO-DET数据集上的实验验证了其有效性。
English Summary: VerbDiff is a novel text-to-image model that enhances interaction depiction by disentangling interaction verbs from common anchors and using localized regions to improve semantic understanding without additional constraints, producing more accurate human-object interaction images as validated on HICO-DET.

Authors:Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang
Title: The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Abstract:
Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.
中文: 基准数据污染导致大语言模型评估性能虚高,尽管已有多种缓解策略,但我们的研究通过新指标发现这些策略均无法有效平衡保真度与抗污染能力,亟需开发更优方案。
English: Benchmark Data Contamination in LLM evaluation leads to inflated performance estimates, and while mitigation strategies exist, our study using novel metrics reveals that none effectively balance fidelity and contamination resistance, highlighting the need for better solutions.

Authors:Chen Chen, Zhirui Wang, Taowei Sheng, Yi Jiang, Yundu Li, Peirui Cheng, Luning Zhang, Kaiqiang Chen, Yanfeng Hu, Xue Yang, Xian Sun
Title: SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World
Abstract:
Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.
Chinese: SA-Occ首次提出卫星辅助的3D占据预测模型,通过动态解耦融合等创新技术将历史卫星图像与实时街景相结合,以39.05% mIoU的最优性能实现感知突破,且仅增加6.93毫秒延迟。
English: SA-Occ introduces the first satellite-assisted 3D occupancy prediction model that integrates historical satellite imagery with real-time street views through innovative fusion techniques, achieving state-of-the-art performance with a 39.05% mIoU and minimal latency.

Authors:Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Title: CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners
Abstract:
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
中文摘要:CaKE是一种新颖的知识编辑方法,通过基于推理电路的分析,有效提升大语言模型对新知识的整合能力,在多跳推理任务中实现更高的准确性和一致性。
English Summary: CaKE is a novel knowledge editing method that improves the integration of updated knowledge into large language models by leveraging circuit-based analysis, resulting in enhanced multi-hop reasoning accuracy and efficiency.

Authors:Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang
Title: Ultra-Resolution Adaptation with Ease
Abstract:
Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.
中文: URAE框架通过利用合成数据加速训练和优化参数调整,显著提升了高分辨率图像生成效果,仅用少量数据和计算资源就实现了业界领先的2K和4K生成性能。
English: The URAE framework enhances high-resolution image generation by using synthetic data for faster training and optimized parameter tuning, achieving state-of-the-art 2K and 4K results with minimal data and computational resources.

Authors:Vivek Gopalakrishnan, Neel Dey, David-Dimitris Chlorogiannis, Andrew Abumoussa, Anna M. Larson, Darren B. Orbach, Sarah Frisken, Polina Golland
Title: Rapid patient-specific neural networks for intraoperative X-ray to volume registration
Abstract:
The integration of artificial intelligence in image-guided interventions holds transformative potential, promising to extract 3D geometric and quantitative information from conventional 2D imaging modalities during complex procedures. Achieving this requires the rapid and precise alignment of 2D intraoperative images (e.g., X-ray) with 3D preoperative volumes (e.g., CT, MRI). However, current 2D/3D registration methods fail across the broad spectrum of procedures dependent on X-ray guidance: traditional optimization techniques require custom parameter tuning for each subject, whereas neural networks trained on small datasets do not generalize to new patients or require labor-intensive manual annotations, increasing clinical burden and precluding application to new anatomical targets. To address these challenges, we present xvr, a fully automated framework for training patient-specific neural networks for 2D/3D registration. xvr uses physics-based simulation to generate abundant high-quality training data from a patient's own preoperative volumetric imaging, thereby overcoming the inherently limited ability of supervised models to generalize to new patients and procedures. Furthermore, xvr requires only 5 minutes of training per patient, making it suitable for emergency interventions as well as planned procedures. We perform the largest evaluation of a 2D/3D registration algorithm on real X-ray data to date and find that xvr robustly generalizes across a diverse dataset comprising multiple anatomical structures, imaging modalities, and hospitals. Across surgical tasks, xvr achieves submillimeter-accurate registration at intraoperative speeds, improving upon existing methods by an order of magnitude. xvr is released as open-source software freely available at https://github.com/eigenvivek/xvr.
中文: xvr框架通过基于物理的模拟训练患者特定的神经网络,实现了快速精确的2D/3D医学图像配准,无需人工标注即可在不同手术中达到亚毫米级精度。
English: The xvr framework enables rapid and precise 2D/3D medical image registration by training patient-specific neural networks through physics-based simulation, achieving submillimeter accuracy across diverse procedures without manual annotations.

Authors:Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qingxiang Lin, Jingwei Huang, Yuhong Liu, Jie Jiang, Chunchao Guo, Xiangyu Yue
Title: Unleashing Vecset Diffusion Model for Fast Shape Generation
Abstract:
3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at https://github.com/Tencent/FlashVDM.
Chinese: FlashVDM通过优化扩散采样和VAE解码来加速3D形状生成,采用渐进流蒸馏和高效向量集解码器等技术,在保持相当质量的同时实现高达45倍的推理速度提升。
English: FlashVDM accelerates 3D shape generation by optimizing both diffusion sampling and VAE decoding, achieving comparable quality with up to 45x faster inference through techniques like Progressive Flow Distillation and an efficient vecset decoder.

Authors:Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, Serge Belongie
Title: Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
Abstract:
Generalized few-shot 3D point cloud segmentation (GFS-PCS) adapts models to new classes with few support samples while retaining base class segmentation. Existing GFS-PCS methods enhance prototypes via interacting with support or query features but remain limited by sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), generalizing across open-world novel classes, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks provide a solid foundation for advancing GFS-PCS in the real world. The code is at https://github.com/ZhaochongAn/GFS-VL
中文: 本文提出GFS-VL框架,通过原型引导的伪标签筛选和自适应填充策略,将3D视觉语言模型的密集伪标签与少量样本的精确信息相融合,并建立新基准推进三维点云分割在实际场景中的应用。
English: This paper introduces GFS-VL, a generalized few-shot 3D point cloud segmentation framework that synergizes dense pseudo-labels from 3D vision-language models with precise few-shot samples through prototype-guided selection and adaptive infilling, while also proposing new benchmarks to advance real-world applications.

Authors:Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke
Title: Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens
Abstract:
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.
中文: Uni-3DAR提出了一种统一的自动回归框架,通过基于八叉树的粗细粒度标记器和压缩策略,实现了跨尺度的三维生成与理解,在多种任务中大幅超越现有方法并显著提升推理速度。
English: Uni-3DAR introduces a unified autoregressive framework that uses a coarse-to-fine octree tokenizer and a novel compression strategy to enable cross-scale 3D generation and understanding, achieving significant performance improvements and faster inference speeds across diverse tasks.

Authors:Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, Liwen Zhang
Title: Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Abstract:
Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at https://github.com/SUFE-AIFLM-Lab/Fin-R1.
中文:Fin-R1是一款专为金融领域设计的70亿参数推理大语言模型,通过两阶段训练方法在多项金融推理任务中实现了最先进的性能表现。
English: Fin-R1 is a specialized 7-billion-parameter reasoning large language model for the financial sector, achieving state-of-the-art performance in financial reasoning tasks through a two-stage training approach.

Authors:Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm
Title: OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
Abstract:
The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.
中文: 本文提出了OpenMIBOOD这一专门评估医学影像中分布外检测方法的框架,发现自然图像基准不适用于医疗领域,并通过标准化基准推动可靠人工智能系统的发展。
English: This paper introduces OpenMIBOOD, a specialized framework for evaluating out-of-distribution detection methods in medical imaging, revealing that natural image benchmarks are inadequate for healthcare applications and providing standardized benchmarks to advance reliable AI systems.

Authors:Quy-Anh Dang, Chris Ngo
Title: Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
Abstract:
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
中文: 本研究证明,强化学习能够以极低成本有效提升小型语言模型的推理能力,相比传统方法仅用少量资源就实现了显著的准确率提升。
English: This study demonstrates that reinforcement learning can efficiently enhance reasoning in small language models using minimal resources, achieving significant accuracy improvements at a fraction of the cost compared to conventional methods.

Authors:Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, Rui Yan
Title: MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion
Abstract:
Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbf{MathFusionQA}, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at https://github.com/QizhiPei/mathfusion.
Chinese: MathFusion提出了一种通过跨问题指令合成增强大语言模型数学推理能力的新框架,在保持高数据效率的同时显著提升了模型性能。
English: MathFusion introduces a novel framework that enhances mathematical reasoning in LLMs through cross-problem instruction synthesis, achieving significant performance gains while maintaining high data efficiency.

Authors:Peihao Wu, Yongxiang Yao, Wenfei Zhang, Dong Wei, Yi Wan, Yansheng Li, Yongjun Zhang
Title: MapGlue: Multimodal Remote Sensing Image Matching
Abstract:
Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection, but it faces severe challenges due to geometric, radiometric, and viewpoint discrepancies across imaging modalities. Existing unimodal datasets lack scale and diversity, limiting deep learning solutions. This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps. Our contributions are twofold. MapData, a globally diverse dataset spanning 233 sampling points, offers original images (7,000x5,000 to 20,000x15,000 pixels). After rigorous cleaning, it provides 121,781 aligned electronic map-visible image pairs (512x512 pixels) with hybrid manual-automated ground truth, addressing the scarcity of scalable multimodal benchmarks. MapGlue integrates semantic context with a dual graph-guided mechanism to extract cross-modal invariant features. This structure enables global-to-local interaction, enhancing descriptor robustness against modality-specific distortions. Extensive evaluations on MapData and five public datasets demonstrate MapGlue's superiority in matching accuracy under complex conditions, outperforming state-of-the-art methods. Notably, MapGlue generalizes effectively to unseen modalities without retraining, highlighting its adaptability. This work addresses longstanding challenges in MRSI matching by combining scalable dataset construction with a robust, semantics-driven framework. Furthermore, MapGlue shows strong generalization capabilities on other modality matching tasks for which it was not specifically trained. The dataset and code are available at https://github.com/PeihaoWu/MapGlue.
Chinese: 本文提出了MapGlue通用多模态遥感图像匹配框架,通过语义上下文和双图引导机制提取跨模态不变特征,并创建了MapData大规模多样化数据集,解决了该领域缺乏可扩展基准的难题。
English: This paper introduces MapGlue, a universal multimodal remote sensing image matching framework that integrates semantic context with a dual graph-guided mechanism for robust cross-modal feature extraction, and MapData, a large-scale, globally diverse dataset designed to address the scarcity of scalable benchmarks in the field.

Authors:Jiwoo Son, Zhikai Zhao, Federico Berto, Chuanbo Hua, Changhyun Kwon, Jinkyoo Park
Title: Neural Combinatorial Optimization for Real-World Routing
Abstract:
Vehicle Routing Problems (VRPs) are a class of NP-hard problems ubiquitous in several real-world logistics scenarios that pose significant challenges for optimization. Neural Combinatorial Optimization (NCO) has emerged as a promising alternative to classical approaches, as it can learn fast heuristics to solve VRPs. However, most research works in NCO for VRPs focus on simplified settings, which do not account for asymmetric distances and travel durations that cannot be derived by simple Euclidean distances and unrealistic data distributions, hindering real-world deployment. This work introduces RRNCO (Real Routing NCO) to bridge the gap of NCO between synthetic and real-world VRPs in the critical aspects of both data and modeling. First, we introduce a new, openly available dataset with real-world data containing a diverse dataset of locations, distances, and duration matrices from 100 cities, considering realistic settings with actual routing distances and durations obtained from Open Source Routing Machine (OSRM). Second, we propose a novel approach that efficiently processes both node and edge features through contextual gating, enabling the construction of more informed node embedding, and we finally incorporate an Adaptation Attention Free Module (AAFM) with neural adaptive bias mechanisms that effectively integrates not only distance matrices but also angular relationships between nodes, allowing our model to capture rich structural information. RRNCO achieves state-of-the-art results in real-world VRPs among NCO methods. We make our dataset and code publicly available at https://github.com/ai4co/real-routing-nco.
中文摘要:本文提出的RRNCO方法通过采用真实世界数据集和新型建模技术,填补了神经组合优化在合成与实际车辆路径问题之间的差距,并实现了最先进的性能表现。
English Summary: This paper introduces RRNCO, a Neural Combinatorial Optimization approach that bridges the gap between synthetic and real-world Vehicle Routing Problems by using realistic datasets and novel modeling techniques to achieve state-of-the-art performance.

Authors:Dong Chen, Boyue Zhao, Yi Zhang, Meng Zhao
Title: Selective Complementary Feature Fusion and Modal Feature Compression Interaction for Brain Tumor Segmentation
Abstract:
Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel networks due to the proliferation of feature dimensions, further increase the difficulty of multi-modal feature fusion at the bottom end. In order to solve the above problems, we propose a noval complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information with an efficient mode fusion strategy. Firstly, we propose a selective complementary feature fusion (SCFF) module, which adaptively fuses rich cross-modal feature information by complementary soft selection weights. Secondly, a modal feature compression interaction (MFCI) transformer is proposed to deal with the multi-mode fusion redundancy problem when the feature dimension surges. The MFCI transformer is composed of modal feature compression (MFC) and modal feature interaction (MFI) to realize redundancy feature compression and multi-mode feature interactive learning. %In MFI, we propose a hierarchical interactive attention mechanism based on multi-head attention. Evaluations on the BraTS2019 and BraTS2020 datasets demonstrate that CFCI-Net achieves superior results compared to state-of-the-art models. Code: https://github.com/CDmm0/CFCI-Net
中文: 提出的CFCI-Net通过选择性互补特征融合和模态特征压缩交互变换器,有效解决了脑胶质瘤分割中的跨模态融合难题,在BraTS数据集上取得了最优性能。
English: The proposed CFCI-Net with its selective complementary feature fusion and modal feature compression interaction transformer effectively addresses cross-modal fusion challenges in brain glioma segmentation, achieving state-of-the-art performance on BraTS datasets.

Authors:Mats Faulborn, Indira Sen, Max Pellert, Andreas Spitz, David Garcia
Title: Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
Abstract:
Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings, and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models. Code and data are available on: https://github.com/MaFa211/theory_grounded_pol_bias
中文: 本研究提出了一种基于政治学理论的偏见测量方法,通过评估11种语言模型在不同提示下的表现,发现指令调优模型普遍存在左倾偏见,同时揭示了传统政治指南针测试在评估中的不稳定性与夸大倾向。
English: This study introduces a political science-informed bias measurement method that evaluates 11 language models across various prompts, revealing generally left-leaning biases in instruction-tuned models and highlighting the instability and exaggeration in traditional Political Compass Test assessments.

Authors:Shiyang Zhou, Haijin Zeng, Yunfan Lu, Tong Shao, Ke Tang, Yongyong Chen, Jie Liu, Jingyong Su
Title: Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
Abstract:
Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized Mamba (Bi-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. Bi-Mamba also incorporates additional global visual information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Our codes and models are available at https://github.com/Clausy9/BMTNet.
中文: 本文提出了一种轻量级的基于Mamba的二元神经网络BMTNet,通过结合Mamba和Swin Transformer架构的优势,在保持高性能的同时显著降低计算复杂度,为混合事件视觉传感器提供了一种适用于边缘设备的实时去马赛克解决方案。
English: This paper introduces a lightweight Mamba-based binary neural network called BMTNet, which combines Mamba and Swin Transformer architectures to efficiently perform Quad Bayer demosaicing for Hybrid Event-based Vision Sensors, achieving high performance with reduced computational complexity suitable for edge devices.

Authors:Jiyong Rao, Brian Nlong Zhao, Yu Wang
Title: Probabilistic Prompt Distribution Learning for Animal Pose Estimation
Abstract:
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.
中文摘要:本文提出一种新颖的概率提示方法,通过优化提示设计和跨模态融合策略,利用视觉语言预训练模型解决多物种动物姿态估计中的跨物种泛化难题。
English Summary: This paper introduces a novel probabilistic prompting approach for multi-species animal pose estimation that leverages vision-language pretrained models to overcome cross-species generalization challenges through optimized prompt design and cross-modal fusion strategies.

Authors:Abdullah Mamun, Diane J. Cook, Hassan Ghasemzadeh
Title: AIMI: Leveraging Future Knowledge and Personalization in Sparse Event Forecasting for Treatment Adherence
Abstract:
Adherence to prescribed treatments is crucial for individuals with chronic conditions to avoid costly or adverse health outcomes. For certain patient groups, intensive lifestyle interventions are vital for enhancing medication adherence. Accurate forecasting of treatment adherence can open pathways to developing an on-demand intervention tool, enabling timely and personalized support. With the increasing popularity of smartphones and wearables, it is now easier than ever to develop and deploy smart activity monitoring systems. However, effective forecasting systems for treatment adherence based on wearable sensors are still not widely available. We close this gap by proposing Adherence Forecasting and Intervention with Machine Intelligence (AIMI). AIMI is a knowledge-guided adherence forecasting system that leverages smartphone sensors and previous medication history to estimate the likelihood of forgetting to take a prescribed medication. A user study was conducted with 27 participants who took daily medications to manage their cardiovascular diseases. We designed and developed CNN and LSTM-based forecasting models with various combinations of input features and found that LSTM models can forecast medication adherence with an accuracy of 0.932 and an F-1 score of 0.936. Moreover, through a series of ablation studies involving convolutional and recurrent neural network architectures, we demonstrate that leveraging known knowledge about future and personalized training enhances the accuracy of medication adherence forecasting. Code available: https://github.com/ab9mamun/AIMI.
中文: 本研究提出AIMI系统,通过智能手机传感器和用药记录,基于LSTM模型实现对心血管患者服药依从性的精准预测,在实验中取得了优异性能。
English: The study introduces AIMI, a machine intelligence system that uses smartphone sensors and medication history to accurately forecast medication adherence, achieving high precision with LSTM models in a cardiovascular patient trial.

Authors:Tim Seizinger, Florin-Alexandru Vasluianu, Marcos V. Conde, Zongwei Wu, Radu Timofte
Title: Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures
Abstract:
Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with variable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data can be found at https://github.com/TimSeizinger/Bokehlicious
Chinese: Bokehlicious 提出了一种高效网络,通过光圈感知注意力机制实现可调节的虚化效果,并提供了高质量的RealBokeh数据集以克服合成数据局限,在真实性和效率上均优于现有方法。
English: Bokehlicious introduces an efficient network with an Aperture-Aware Attention mechanism for adjustable Bokeh effects and a high-quality RealBokeh dataset to overcome synthetic data limitations, outperforming existing methods in realism and efficiency.

Authors:Qiang Zou, Shuli Cheng, Jiayi Chen
Title: PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval
Abstract:
Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at https://github.com/ShiShuMo/PromptHash.
中文摘要:PromptHash提出了一种利用亲和性提示感知协同学习的创新框架,通过自适应跨模态哈希技术显著提升检索性能,有效解决了语义保持和模态异质性问题。
English Summary: PromptHash introduces an innovative framework using affinity prompt-aware collaborative learning to enhance cross-modal hashing, achieving significant performance improvements in retrieval tasks by addressing semantic preservation and modal heterogeneity.

Authors:Wanshu Fan, Yue Wang, Cong Wang, Yunzhe Zhang, Wei Wang, Dongsheng Zhou
Title: Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution
Abstract:
Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems, which are integral to various vision-based instrumentation and measurement applications. These systems often require clear and detailed images for precise object detection and recognition. However, images captured by visual measurement tools frequently suffer from degradation, including blurring and loss of detail, which can impede measurement accuracy.As a potential remedy, we in this paper propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR. Our SGGLC-Net leverages semantic priors extracted from a pre-trained model to guide the super-resolution process, enhancing image detail quality effectively. Specifically,we propose a Semantic Guidance Module that seamlessly integrates the semantic priors into the super-resolution network, enabling the network to more adeptly capture and utilize semantic priors, thereby enhancing image details. To further explore both local and non-local interactions for improved detail rendition,we propose a Global-Local Collaborative Module, which features three Global and Local Detail Enhancement Modules, as well as a Hybrid Attention Mechanism to work together to efficiently learn more useful features. Our extensive experiments show that SGGLC-Net achieves competitive PSNR and SSIM values across multiple benchmark datasets, demonstrating higher performance with the multi-adds reduction of 12.81G compared to state-of-the-art lightweight super-resolution approaches. These improvements underscore the potential of our approach to enhance the precision and effectiveness of visual measurement systems. Codes are at https://github.com/fanamber831/SGGLC-Net.
中文: 本文提出SGGLC-Net轻量级单图像超分辨率网络,通过语义引导和全局-局部协作增强图像细节,在降低计算量的同时实现了优越性能,可有效提升视觉测量系统的精度。
English: This paper introduces SGGLC-Net, a lightweight single-image super-resolution network that uses semantic guidance and global-local collaboration to enhance image details, achieving competitive performance with reduced computational costs for visual measurement systems.

Authors:Abdelrahman Elsayed, Sarim Hashmi, Mohammed Elseiagy, Hu Wang, Mohammad Yaqub, Ibrahim Almakky
Title: SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation
Abstract:
The complex nature of medical image segmentation calls for models that are specifically designed to capture detailed, domain-specific features. Large foundation models offer considerable flexibility, yet the cost of fine-tuning these models remains a significant barrier. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), efficiently update model weights with low-rank matrices but may suffer from underfitting when the chosen rank is insufficient to capture domain-specific nuances. Conversely, full-rank Singular Value Decomposition (SVD) based methods provide comprehensive updates by modifying all singular values, yet they often lack flexibility and exhibit variable performance across datasets. We propose SALT (Singular Value Adaptation with Low-Rank Transformation), a method that selectively adapts the most influential singular values using trainable scale and shift parameters while complementing this with a low-rank update for the remaining subspace. This hybrid approach harnesses the advantages of both LoRA and SVD, enabling effective adaptation without relying on increasing model size or depth. Evaluated on 5 challenging medical datasets, ranging from as few as 20 samples to 1000, SALT outperforms state-of-the-art PEFT (LoRA and SVD) by 2% to 5% in Dice with only 3.9% trainable parameters, demonstrating robust adaptation even in low-resource settings. The code for SALT is available at: https://github.com/BioMedIA-MBZUAI/SALT
中文: 提出的SALT方法结合选择性奇异值适应与低秩更新,在医学图像分割任务中以仅3.9%可训练参数实现比现有高效调参技术高出2-5%的性能表现。
English: The proposed SALT method combines selective singular value adaptation with low-rank updates to outperform existing parameter-efficient fine-tuning techniques by 2-5% on medical image segmentation tasks while using only 3.9% trainable parameters.

Authors:Zhiyu Cao, Peifeng Li, Yaxin Fan, Qiaoming Zhu
Title: Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation
Abstract:
Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at https://github.com/Dewset/EO-IUR.
中文摘要:现有不完整话语改写方法因无法聚焦关键对话标记及训练数据有限,常生成冗余内容,而本文提出的EO-IUR框架通过引入编辑操作引导的多任务学习和二维话语增强策略,在多个数据集上实现了优于现有最优模型的性能表现。
English Summary: Current Incomplete Utterance Rewriting methods produce coherent but often redundant outputs due to insufficient focus on critical dialogue tokens and limited training data, which the proposed EO-IUR framework addresses through multi-task learning with editing operation guidance and a novel data augmentation strategy, outperforming state-of-the-art models across multiple datasets.

Authors:Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie
Title: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Abstract:
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.
Chinese Summary: 提出的HICom方法通过指令引导混合层级令牌压缩,在多模态大语言模型中优化视频压缩,聚焦用户相关信息以提升性能并显著降低计算负担。
English Summary: The proposed HICom method enhances video compression in Multi-modal Large Language Models by using instructions to guide hybrid-level token compression, improving performance and reducing computational load by focusing on user-relevant information.

Authors:Sunqi Fan, Meng-Hao Guo, Shuojin Yang
Title: Agentic Keyframe Search for Video Question Answering
Abstract:
Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.
Chinese: 提出的Agentic Keyframe Search (AKeyS)算法通过语言智能体引导搜索策略,在VideoQA任务中高效识别关键帧,在EgoSchema和NExT-QA数据集上的实验表明其能以最低计算成本实现最优准确率。
English: The proposed Agentic Keyframe Search (AKeyS) algorithm efficiently identifies keyframes in VideoQA tasks by leveraging language agents to guide search algorithms, achieving superior accuracy with minimal computational overhead as demonstrated on EgoSchema and NExT-QA datasets.

Authors:Xiaomeng Chu, Jiajun Deng, Guoliang You, Wei Liu, Xingchen Li, Jianmin Ji, Yanyong Zhang
Title: GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Abstract:
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. The code is available at https://github.com/cxmomo/GraspCoT.
中文摘要:本文提出GraspCoT框架,通过面向物理属性的思维链推理机制改进机器人抓取,并在新基准IntentGrasp和实际应用中验证了其优越性。
English Summary: This paper introduces GraspCoT, a 6-DoF grasp detection framework that enhances robotic grasping through Chain-of-Thought reasoning about physical properties, validated on the new IntentGrasp benchmark and real-world applications.

Authors:Zeqi Zheng, Yanchen Huang, Yingchao Yu, Zizheng Zhu, Junfeng Tang, Zhaofei Yu, Yaochu Jin
Title: SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition
Abstract:
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. The code and model checkpoints are publicly available at https://github.com/KirinZheng/SpiLiFormer.
中文摘要:SpiLiFormer模型通过引入侧向抑制机制,解决了基于Transformer的脉冲神经网络中注意力过度分配的问题,在多个数据集上以更高效率实现了最先进的性能。
English Summary: The SpiLiFormer model introduces a lateral inhibition mechanism to address the issue of over-allocating attention in Transformer-based Spiking Neural Networks, achieving state-of-the-art performance across multiple datasets with significantly improved efficiency.

Authors:Pengyu Liu, Guohua Dong, Dan Guo, Kun Li, Fengling Li, Xun Yang, Meng Wang, Xiaomin Ying
Title: A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Abstract:
In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit https://github.com/LpyNow/BrainDecodingImage.
中文: 本综述系统回顾了基于fMRI的大脑解码在刺激重建方面的最新进展,重点介绍了利用生成模型和多模态预训练的技术提升,同时探讨了时间分辨率等挑战并提出了未来研究方向。
English: This survey reviews recent advances in fMRI-based brain decoding for stimulus reconstruction, highlighting improved techniques using generative models and multimodal pre-training while addressing challenges like temporal resolution and proposing future research directions.

Authors:Zichen Liu, Kunlun Xu, Bing Su, Xu Zou, Yuxin Peng, Jiahuan Zhou
Title: STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Abstract:
Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.
中文:提出的STOP模型通过动态生成空间提示来突出帧内关键区域,以及时间提示来优先处理重要帧,从而提升视频理解能力,在多个视频基准测试中取得了优越性能。
English: The proposed STOP model enhances video understanding by dynamically generating spatial prompts to highlight key regions within frames and temporal prompts to prioritize important frames, achieving superior performance on video benchmarks.

Authors:Clive Tinashe Marimo, Benedikt Blumenstiel, Maximilian Nitsche, Johannes Jakubik, Thomas Brunschwiler
Title: Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation
Abstract:
Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated using Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by +6.77% on average and retrieval performance by +4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. The image-caption dataset, code, and model weights are available at https://github.com/IBM/MS-CLIP.
中文摘要:Llama3-MS-CLIP是首个基于多光谱卫星数据预训练的视觉语言模型,在分类任务中平均准确率提升6.77%,检索任务中mAP提高4.63%,显著优于传统RGB模型。
English Summary: Llama3-MS-CLIP is the first vision-language model pre-trained on multispectral satellite data, significantly outperforming RGB-based models with a 6.77% average accuracy improvement in classification and 4.63% mAP boost in retrieval tasks.

Authors:Yaxiong Chen, Minghong Wei, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: CausalCLIPSeg: Unlocking CLIP's Potential in Referring Medical Image Segmentation with Causal Intervention
Abstract:
Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP's rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at https://github.com/WUTCM-Lab/CausalCLIPSeg.
中文摘要:CausalCLIPSeg是一种端到端框架,通过定制跨模态解码将CLIP的语义空间应用于医学领域,并引入因果干预模块消除混杂偏差,从而在医学图像分割中实现文本与像素的精准对齐。
English Summary: CausalCLIPSeg is an end-to-end framework that adapts CLIP's vision-language capabilities for medical image segmentation by aligning text descriptions with pixel-level features and incorporating causal intervention to eliminate spurious correlations.

Authors:Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
Abstract:
Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.
中文摘要:本研究提出UniCrossAdapter轻量适配器,通过将其集成到CLIP模型中弥合自然图像与医学图像的领域差异,在有限标注数据条件下显著提升放射学报告生成的跨模态语义对齐效果。
English Summary: This study introduces UniCrossAdapter, lightweight adapters integrated into the CLIP model to bridge the domain gap between natural and medical images, enhancing cross-modal alignment for improved radiology report generation with limited labeled data.

Authors:Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
Title: Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
Abstract:
We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1X faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.
中文: V-Droid是一种移动GUI任务自动化代理,采用大语言模型作为验证器在执行前评估候选动作,在多项基准测试中不仅任务成功率更高,且响应速度比现有代理快6.1倍。
English: V-Droid is a mobile GUI task automation agent that uses LLMs as verifiers to evaluate candidate actions before execution, achieving higher task success rates and significantly faster response times than existing agents.

Authors:Sidi Yang, Binxiao Huang, Yulun Zhang, Dahai Yu, Yujiu Yang, Ngai Wong
Title: DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
Abstract:
While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1% energy consumption compared to its CNN contestant DnCNN, while delivering 20X faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising. The project is available at https://github.com/Stephen0808/DnLUT.
中文: DnLUT是一种基于查找表的超高效框架,通过创新的并行通道混合器和L形卷积设计,仅需500KB存储和DnCNN 0.1%的能耗即可实现高质量彩色图像去噪,推理速度提升20倍,并在PSNR指标上超越现有查找表方法超过1dB。
English: DnLUT is an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption, requiring only 500KB storage and 0.1% energy compared to DnCNN while delivering 20X faster inference and outperforming existing LUT methods by over 1dB in PSNR.

Authors:Jiawei Wang, Kai Hu, Qiang Huo
Title: UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis
Abstract:
Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks. The Comp-HRDoc benchmark and UniHDSA's configurations are publicly available at https://github.com/microsoft/CompHRDoc.
Chinese: 本文提出UniHDSA方法,通过将层次化文档结构分析任务统一为关系预测问题,基于多模态Transformer架构实现了最优性能,在多个基准测试中验证了其全面优势。
English: This paper introduces UniHDSA, a unified relation prediction approach that consolidates various hierarchical document structure analysis tasks into a single module, achieving state-of-the-art performance on benchmarks through a multimodal Transformer-based system.

Authors:Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, Xueqi Cheng
Title: Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models
Abstract:
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose **CK-PLUG**, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{https://github.com/byronBBL/CK-PLUG}{\text{this https URL}}$.
中文: 提出的CK-PLUG方法通过检测熵移的知识冲突并调整标记概率,动态控制大语言模型对参数化知识与上下文知识的依赖程度,在保持生成质量的同时实现了对知识偏好的有效调控,并在多种RAG场景中取得一致性能提升。
English: The proposed CK-PLUG method dynamically controls Large Language Models' reliance on parametric versus contextual knowledge by detecting knowledge conflicts through entropy shifts and adjusting token probabilities, achieving significant regulation of knowledge preference while maintaining generation quality across various RAG scenarios.

Authors:DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng
Title: Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation
Abstract:
Addressing non-factoid question answering (NFQA) remains challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning. These characteristics often reveal the limitations of conventional retrieval-augmented generation (RAG) approaches. To overcome these challenges, we propose Typed-RAG, a framework for type-aware decomposition of non-factoid questions (NFQs) within the RAG paradigm. Specifically, Typed-RAG first classifies an NFQ into a predefined type (e.g., Debate, Experience, Comparison). It then decomposes the question into focused sub-queries, each focusing on a single aspect. This decomposition enhances both retrieval relevance and answer quality. By combining the results of these sub-queries, Typed-RAG produces more informative and contextually aligned responses. Additionally, we construct Wiki-NFQA, a benchmark dataset for NFQA covering a wide range of NFQ types. Experiments show that Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, validating the effectiveness of type-aware decomposition for improving both retrieval quality and answer generation in NFQA. Our code and dataset are available on https://github.com/TeamNLP/Typed-RAG.
Chinese: Typed-RAG框架通过将非事实性问题分类为预定义类型并分解为聚焦子查询,提升了检索相关性和答案质量,在Wiki-NFQA基准测试中的实验验证了其有效性。
English: The Typed-RAG framework enhances non-factoid question answering by classifying questions into predefined types and decomposing them into focused sub-queries, improving retrieval relevance and answer quality, as validated by experiments on the Wiki-NFQA benchmark.

Authors:Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, Bing Wang
Title: MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving
Abstract:
In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi-mlab/mila.github.io.
Chinese: MiLA框架通过粗到精的方法和专门模块,解决了自动驾驶训练中生成长时、一致性视频的难题,在nuScenes数据集上实现了最先进的性能。
English: The MiLA framework addresses the challenge of generating long, consistent videos for autonomous driving training by using a coarse-to-refine approach and specialized modules, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Zhenglin Zhou, Fan Ma, Hehe Fan, Tat-Seng Chua
Title: Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
Abstract:
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.
Chinese: Zero-1-to-A提出了一种渐进式学习方法,通过视频扩散模型构建时空一致性数据集,实现了高保真4D虚拟人生成,显著提升了动画质量和渲染效率。
English: Zero-1-to-A introduces a progressive learning method that constructs spatial and temporal consistency datasets through video diffusion models, enabling high-fidelity 4D avatar generation with improved animation quality and rendering efficiency.

Authors:Zhiyu An, Zhibo Hou, Wan Du
Title: Disentangling Uncertainties by Learning Compressed Data Representation
Abstract:
We study aleatoric and epistemic uncertainty estimation in a learned regressive system dynamics model. Disentangling aleatoric uncertainty (the inherent randomness of the system) from epistemic uncertainty (the lack of data) is crucial for downstream tasks such as risk-aware control and reinforcement learning, efficient exploration, and robust policy transfer. While existing approaches like Gaussian Processes, Bayesian networks, and model ensembles are widely adopted, they suffer from either high computational complexity or inaccurate uncertainty estimation. To address these limitations, we propose the Compressed Data Representation Model (CDRM), a framework that learns a neural network encoding of the data distribution and enables direct sampling from the output distribution. Our approach incorporates a novel inference procedure based on Langevin dynamics sampling, allowing CDRM to predict arbitrary output distributions rather than being constrained to a Gaussian prior. Theoretical analysis provides the conditions where CDRM achieves better memory and computational complexity compared to bin-based compression methods. Empirical evaluations show that CDRM demonstrates a superior capability to identify aleatoric and epistemic uncertainties separately, achieving AUROCs of 0.8876 and 0.9981 on a single test set containing a mixture of both uncertainties. Qualitative results further show that CDRM's capability extends to datasets with multimodal output distributions, a challenging scenario where existing methods consistently fail. Code and supplementary materials are available at https://github.com/ryeii/CDRM.
Chinese: 本文提出压缩数据表示模型(CDRM),该框架通过神经网络编码和朗之万动力学采样,能精确区分系统动力学模型中的偶然不确定性和认知不确定性,在计算效率与不确定性估计方面均优于现有方法。
English: This paper introduces the Compressed Data Representation Model (CDRM), a framework that accurately distinguishes aleatoric and epistemic uncertainties in system dynamics models through neural network encoding and Langevin dynamics sampling, outperforming existing methods in both computational efficiency and uncertainty estimation.

Authors:Jingyun Liu, Daiqin Yang, Zhenzhong Chen
Title: Frequency Enhancement for Image Demosaicking
Abstract:
Recovering high-frequency textures in image demosaicking remains a challenging issue. While existing methods introduced elaborate spatial learning methods, they still exhibit limited performance. To address this issue, a frequency enhancement approach is proposed. Based on the frequency analysis of color filter array (CFA)/demosaicked/ground truth images, we propose Dual-path Frequency Enhancement Network (DFENet), which reconstructs RGB images in a divide-and-conquer manner through fourier-domain frequency selection. In DFENet, two frequency selectors are employed, each selecting a set of frequency components for processing along separate paths. One path focuses on generating missing information through detail refinement in spatial domain, while the other aims at suppressing undesirable frequencies with the guidance of CFA images in frequency domain. Multi-level frequency supervision with a stagewise training strategy is employed to further improve the reconstruction performance. With these designs, the proposed DFENet outperforms other state-of-the-art algorithms on different datasets and demonstrates significant advantages on hard cases. Moreover, to better assess algorithms' ability to reconstruct high-frequency textures, a new dataset, LineSet37, is contributed, which consists of 37 artificially designed and generated images. These images feature complex line patterns and are prone to severe visual artifacts like color moiré after demosaicking. Experiments on LineSet37 offer a more targeted evaluation of performance on challenging cases. The code and dataset are available at https://github.com/VelvetReverie/DFENet-demosaicking.
中文摘要:提出的双路径频率增强网络(DFENet)通过空间域细节优化和频域干扰抑制的双路径设计,有效提升了图像去马赛克中高频纹理的恢复能力,在多个数据集上超越现有方法,并专门构建了针对高频重建评估的新数据集。
English Summary: The proposed Dual-path Frequency Enhancement Network (DFENet) improves image demosaicking by processing frequency components through separate spatial refinement and frequency suppression paths, outperforming existing methods and introducing a specialized dataset for evaluating high-frequency texture reconstruction.

Authors:Tsunehiko Tanaka, Edgar Simo-Serra
Title: Grammar and Gameplay-aligned RL for Game Description Generation with LLMs
Abstract:
Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone. Our code is available at https://github.com/tsunehiko/rlgdg
中文: 本文提出RLGDG方法,通过结合语法奖励和概念奖励的强化学习微调大语言模型,显著提升了游戏描述生成的准确性和概念保真度,效果优于单纯监督微调。
English: This paper introduces RLGDG, a reinforcement learning-based fine-tuning method for Large Language Models that enhances Game Description Generation by combining grammar and concept rewards, significantly outperforming supervised fine-tuning alone.

Authors:Joanikij Chulev, Angela Mladenovska
Title: Line Space Clustering (LSC): Feature-Based Clustering using K-medians and Dynamic Time Warping for Versatility
Abstract:
Clustering high-dimensional data is a critical challenge in machine learning due to the curse of dimensionality and the presence of noise. Traditional clustering algorithms often fail to capture the intrinsic structures in such data. This paper explores a combination of clustering methods, which we called Line Space Clustering (LSC), a representation that transforms data points into lines in a newly defined feature space, enabling clustering based on the similarity of feature value patterns, essentially treating features as sequences. LSC employs a combined distance metric that uses Euclidean and Dynamic Time Warping (DTW) distances, weighted by a parameter α, allowing flexibility in emphasizing shape or magnitude similarities. We delve deeply into the mechanics of DTW and the Savitzky Golay filter, explaining their roles in the algorithm. Extensive experiments demonstrate the efficacy of LSC on synthetic and real-world datasets, showing that randomly experimenting with time-series optimized methods sometimes might surprisingly work on a complex dataset, particularly in noisy environments. Source code and experiments are available at: https://github.com/JoanikijChulev/LSC.
中文: 本文提出线空间聚类(LSC)新方法,通过将高维数据转换为线条并采用欧几里得-DTW混合距离度量,在噪声环境中有效捕捉特征模式的内在结构,实现精准聚类。
English: This paper introduces Line Space Clustering (LSC), a novel method that transforms high-dimensional data into lines and uses a combined Euclidean-DTW distance metric to effectively cluster noisy datasets by capturing intrinsic feature patterns.

Authors:Panagiota Moraiti, Efstathios Karypidis
Title: Technical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data -- 4th Place Solution
Abstract:
This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68\% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at https://github.com/panagiotamoraiti/continual-learning-challenge-2024 .
中文: 本文针对重复类增量学习场景,提出了一种结合知识蒸馏和伪标签技术的方法,利用未标记数据缓解灾难性遗忘,显著超越了基线准确率。
English: This paper presents a method for the Class-Incremental with Repetition challenge, utilizing knowledge distillation and pseudo-labeling on unlabeled data to mitigate catastrophic forgetting and improve accuracy over the baseline.

Authors:Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe
Title: Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Abstract:
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.
潜在扩散模型在生成图像时易丢失面部和服装等敏感区域的细节,因此提出的多焦点条件潜在扩散方法通过分离这些区域的特征来增强图像真实感,在DeepFashion等数据集上提升了身份一致性和编辑灵活性。
The Latent Diffusion Model often loses fine details in sensitive areas like faces and clothing, so the proposed Multi-focal Conditioned Latent Diffusion method enhances image realism by focusing on disentangled features from these regions, improving identity consistency and editing flexibility on datasets like DeepFashion.

Authors:Fausto German, Brian Keith, Chris North
Title: Narrative Trails: A Method for Coherent Storyline Extraction via Maximum Capacity Path Optimization
Abstract:
Traditional information retrieval is primarily concerned with finding relevant information from large datasets without imposing a structure within the retrieved pieces of data. However, structuring information in the form of narratives--ordered sets of documents that form coherent storylines--allows us to identify, interpret, and share insights about the connections and relationships between the ideas presented in the data. Despite their significance, current approaches for algorithmically extracting storylines from data are scarce, with existing methods primarily relying on intricate word-based heuristics and auxiliary document structures. Moreover, many of these methods are difficult to scale to large datasets and general contexts, as they are designed to extract storylines for narrow tasks. In this paper, we propose Narrative Trails, an efficient, general-purpose method for extracting coherent storylines in large text corpora. Specifically, our method uses the semantic-level information embedded in the latent space of deep learning models to build a sparse coherence graph and extract narratives that maximize the minimum coherence of the storylines. By quantitatively evaluating our proposed methods on two distinct narrative extraction tasks, we show the generalizability and scalability of Narrative Trails in multiple contexts while also simplifying the extraction pipeline.
中文: 传统信息检索主要关注从大数据集中查找相关信息而不对数据进行结构化,但本文提出的Narrative Trails方法利用深度学习模型中的语义信息构建稀疏连贯图来提取连贯叙事线,在多种场景下展现出可扩展性和简化流程的优势。
English: Traditional information retrieval focuses on finding relevant data without structuring it, but Narrative Trails introduces an efficient, general-purpose method using semantic information from deep learning models to extract coherent storylines, demonstrating scalability and simplicity across multiple contexts.

Authors:Cédric Vincent, Taehyoung Kim, Henri Meeß
Title: High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight
Abstract:
Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach-suited to onboard real-time inference-achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency. Project page: https://github.com/FraunhoferIVI/SSP.
Chinese Summary: 本文提出KD-SSP轻量级视频语义分割方法,通过语义相似性传播和一致性感知知识蒸馏提升航拍数据的时序稳定性,在分割质量和速度上实现更优平衡。
English Summary: This paper introduces KD-SSP, a lightweight video semantic segmentation method that enhances temporal consistency in aerial data through Semantic Similarity Propagation and consistency-aware Knowledge Distillation, achieving superior segmentation quality and speed.

Authors:Yuming Gu, Phong Tran, Yujian Zheng, Hongyi Xu, Heyuan Li, Adilbek Karmanov, Hao Li
Title: DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
Abstract:
Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
中文: 该创新方法通过改进DiffPortrait3D框架,结合定制ControlNet和双重外观模块,可为不同类型头部及配饰生成一致的360度视图,并能创建高质量神经辐射场实现实时自由视角渲染,性能超越现有技术。
English: This novel method generates consistent 360-degree head views for various forms and accessories by enhancing the DiffPortrait3D framework with a custom ControlNet and dual appearance module, enabling high-quality neural radiance fields for real-time free-viewpoint rendering that surpasses current techniques.

Authors:Luc McCutcheon, Bahman Gharesifard, Saber Fallah
Title: Neural Lyapunov Function Approximation with Self-Supervised Reinforcement Learning
Abstract:
Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample-efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self-supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data-driven World Model to train Lyapunov functions from off-policy trajectories. The method is validated on both standard and goal-conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state-of-the-art neural Lyapunov approximation baseline. The code is available at: https://github.com/CAV-Research-Lab/SACLA.git
中文: 本文提出了一种样本高效的神经网络方法,通过自监督强化学习和数据驱动世界模型来近似非线性李雅普诺夫函数,在机器人任务中相比现有基准实现了更快的收敛速度和更高的近似精度。
English: This paper introduces a sample-efficient neural method for approximating nonlinear Lyapunov functions using self-supervised reinforcement learning and a data-driven World Model, achieving superior convergence speed and accuracy in robotic tasks compared to existing baselines.

Authors:Matthew Massey, Abdullah-Al-Zubaer Imran
Title: EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis
Abstract:
Surficial geologic mapping is essential for understanding Earth surface processes, addressing modern challenges such as climate change and national security, and supporting common applications in engineering and resource management. However, traditional mapping methods are labor-intensive, limiting spatial coverage and introducing potential biases. To address these limitations, we introduce EarthScape, a novel, AI-ready multimodal dataset specifically designed for surficial geologic mapping and Earth surface analysis. EarthScape integrates high-resolution aerial RGB and near-infrared (NIR) imagery, digital elevation models (DEM), multi-scale DEM-derived terrain features, and hydrologic and infrastructure vector data. The dataset provides detailed annotations for seven distinct surficial geologic classes encompassing various geological processes. We present a comprehensive data processing pipeline using open-sourced raw data and establish baseline benchmarks using different spatial modalities to demonstrate the utility of EarthScape. As a living dataset with a vision for expansion, EarthScape bridges the gap between computer vision and Earth sciences, offering a valuable resource for advancing research in multimodal learning, geospatial analysis, and geological mapping. Our code is available at https://github.com/masseygeo/earthscape.
中文: EarthScape是一个专为地表地质测绘设计的多模态AI数据集,它通过整合多种地理空间数据和七类详细地质标注,有效解决了传统测绘方法的局限性,搭建了计算机视觉与地球科学之间的研究桥梁。
English: EarthScape is an AI-ready multimodal dataset designed to overcome the limitations of traditional surficial geologic mapping by integrating diverse geospatial data and providing detailed annotations for seven geologic classes, serving as a bridge between computer vision and Earth sciences.

Authors:Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.
Chinese: 本文介绍了LLaVA-MORE系列多模态大语言模型,通过统一训练协议系统评估语言模型与视觉骨干网络的相互作用,为设计更有效的MLLM提供了见解。
English: The paper introduces LLaVA-MORE, a family of multimodal large language models that systematically evaluates the interplay between language models and visual backbones using a unified training protocol to provide insights for designing more effective MLLMs.

Authors:Masud Ahmed, Zahid Hasan, Syed Arefinul Haque, Abu Zaher Md Faridee, Sanjay Purushotham, Suya You, Nirmalya Roy
Title: CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
Abstract:
Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git
Chinese: 本文提出了一种用于语义分割的连续值嵌入框架,通过扩散引导的自回归变换器替代传统量化嵌入,在多个数据集上展现出对领域偏移和噪声的卓越鲁棒性。
English: This paper introduces a continuous-valued embedding framework for semantic segmentation that replaces traditional quantized embeddings with a diffusion-guided autoregressive transformer, achieving superior robustness to domain shifts and noise across various datasets.

Authors:Martin Ritzert, Polina Turishcheva, Laura Hansel, Paul Wollenhaupt, Marissa A. Weis, Alexander S. Ecker
Title: Hierarchical clustering with maximum density paths and mixture models
Abstract:
Hierarchical clustering is an effective, interpretable method for analyzing structure in data. It reveals insights at multiple scales without requiring a predefined number of clusters and captures nested patterns and subtle relationships, which are often missed by flat clustering approaches. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. In this work, we introduce t-NEB, a probabilistically grounded hierarchical clustering method, which yields state-of-the-art clustering performance on naturalistic high-dimensional data. t-NEB consists of three steps: (1) density estimation via overclustering; (2) finding maximum density paths between clusters; (3) creating a hierarchical structure via bottom-up cluster merging. t-NEB uses a probabilistic parametric density model for both overclustering and cluster merging, which yields both high clustering performance and a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.
Chinese: t-NEB方法是一种基于概率的分层聚类技术,通过密度估计和自底向上合并克服了现有方法处理高维数据的局限,实现了最优聚类性能并构建出有意义的层次结构。
English: The t-NEB method is a probabilistically grounded hierarchical clustering technique that overcomes the limitations of existing methods with high-dimensional data by using density estimation and bottom-up merging to achieve state-of-the-art performance and meaningful hierarchies.

Authors:Alba Márquez-Rodríguez, Miguel Ángel Mohedano-Munoz, Manuel J. Marín-Jiménez, Eduardo Santamaría-García, Giulia Bastianelli, Pedro Jordano, Irene Mendoza
Title: A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana
Abstract:
Passive Acoustic Monitoring is a key tool for biodiversity conservation, but the large volumes of unsupervised audio it generates present major challenges for extracting meaningful information. Deep Learning offers promising solutions. BirdNET, a widely used bird identification model, has shown success in many study systems but is limited at local scale due to biases in its training data, which focus on specific locations and target sounds rather than entire soundscapes. A key challenge in bird species identification is that many recordings either lack target species or contain overlapping vocalizations, complicating automatic identification. To address these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Doñana National Park (SW Spain), a wetland of high conservation concern. We deployed AudioMoth recorders in three main habitats across nine locations and manually annotated 461 minutes of audio, resulting in 3749 labeled segments spanning 34 classes. We first applied a Bird Song Detector to isolate bird vocalizations using spectrogram-based image processing. Then, species were classified using custom models trained at the local scale. Applying the Bird Song Detector before classification improved species identification, as all models performed better when analyzing only the segments where birds were detected. Specifically, the combination of detector and fine-tuned BirdNET outperformed the baseline without detection. This approach demonstrates the effectiveness of integrating a Bird Song Detector with local classification models. These findings highlight the need to adapt general-purpose tools to specific ecological challenges. Automatically detecting bird species helps track the health of this threatened ecosystem, given birds sensitivity to environmental change, and supports conservation planning to reduce biodiversity loss.
中文: 针对多尼亚纳国家公园开发的多阶段识别流程,通过结合鸟类鸣叫检测器与本地优化的分类模型,显著提升了鸟类自动识别的准确性,凸显了针对具体生态挑战定制化通用工具对于有效保护生物多样性的重要性。
English: A multi-stage pipeline integrating a Bird Song Detector with locally fine-tuned classification models was developed to enhance automatic bird species identification in Doñana National Park, demonstrating improved performance over general tools by adapting to specific ecological challenges for effective biodiversity conservation.

Authors:NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Maosheng Liao, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Xiangyu Lu, Alice Luo, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Jingxu Zhang, Xiaohui Zeng, Zhe Zhang
Title: Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Abstract:
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. We curate data and train our models in two stages: Physical AI supervised fine-tuning (SFT) and Physical AI reinforcement learning (RL). To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and RL bring significant improvements. To facilitate the development of Physical AI, we make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.
中文:Cosmos-Reason1模型通过长链思维推理过程理解物理世界并生成具身决策,利用分层和二维本体表示物理常识和具身推理,经过监督微调和强化学习训练后展现出显著性能提升。
English: The Cosmos-Reason1 models are designed to understand the physical world and generate embodied decisions through long reasoning processes, utilizing hierarchical and two-dimensional ontologies for physical common sense and embodied reasoning, with significant improvements shown through supervised fine-tuning and reinforcement learning.

Authors:Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora
Title: What Makes a Reward Model a Good Teacher? An Optimization Perspective
Abstract:
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.
Chinese: 基于人类反馈的强化学习(RLHF)的成功不仅取决于奖励模型的准确性,还要求其能产生足够的奖励方差,因为低方差会导致优化景观平坦和学习缓慢,即使模型准确性很高。
English: The effectiveness of Reinforcement Learning from Human Feedback (RLHF) relies not only on the accuracy of the reward model but also on its ability to induce sufficient reward variance, as low variance can lead to a flat optimization landscape and slow learning, even with high accuracy.

Authors:Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora
Title: What Makes a Reward Model a Good Teacher? An Optimization Perspective
Abstract:
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.
Chinese: 基于人类反馈的强化学习(RLHF)的成功不仅取决于奖励模型的准确性,还要求其能产生足够的奖励方差,因为低方差会导致优化景观平坦和学习缓慢,即使模型准确性很高。
English: The effectiveness of Reinforcement Learning from Human Feedback (RLHF) relies not only on the accuracy of the reward model but also on its ability to induce sufficient reward variance, as low variance can lead to a flat optimization landscape and slow learning, even with high accuracy.

Authors:Foundation AI Team, Kiran Bhat, Nishchaie Khanna, Karun Channa, Tinghui Zhou, Yiheng Zhu, Xiaoxia Sun, Charles Shang, Anirudh Sudarshan, Maurice Chu, Daiqing Li, Kangle Deng, Jean-Philippe Fauconnier, Tijmen Verhulsdonck, Maneesh Agrawala, Kayvon Fatahalian, Alexander Weiss, Christian Reiser, Ravi Kiran Chirravuri, Ravali Kandur, Alejandro Pelaez, Akash Garg, Michael Palleschi, Jessica Wang, Skylar Litz, Leon Liu, Anying Li, David Harmon, Derek Liu, Liangjun Feng, Denis Goupil, Lukas Kuczynski, Jihyun Yoon, Naveen Marri, Peiye Zhuang, Yinan Zhang, Brian Yin, Haomiao Jiang, Marcel van Workum, Thomas Lane, Bryce Erickson, Salil Pathare, Kyle Price, Steve Han, Yiqing Wang, Anupam Singh, David Baszucki
Title: Cube: A Roblox View of 3D Intelligence
Abstract:
Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
Chinese: Roblox正在构建一个3D智能基础模型,旨在帮助开发者生成3D对象、场景和脚本,目前重点开发了3D形状分词器,支持文本到形状及场景的生成应用。
English: Roblox is developing a foundation model for 3D intelligence to assist developers in generating 3D objects, scenes, and scripts, with a current focus on a 3D shape tokenizer that enables text-to-shape and scene generation applications.

Authors:Boshen Xu, Yuting Mei, Xinbi Liu, Sipeng Zheng, Qin Jin
Title: EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Abstract:
Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.
中文: EgoDTM提出了一种新型以自我为中心的视频-语言预训练模型,通过结合伪深度图和增强字幕来提升三维感知视觉理解能力,在多项任务中表现出卓越性能。
English: EgoDTM introduces a novel egocentric video-language pretraining model that enhances 3D-aware visual understanding by integrating pseudo depth maps and enriched captions, achieving superior performance across various tasks.

Authors:Ruichen Chen, Keith G. Mills, Di Niu
Title: FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers
Abstract:
Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$α$, PixArt-$Σ$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.
中文摘要:FP4DiT提出了一种采用浮点量化的后训练量化方法,在扩散变换器上实现W4A6精度,优于基于整数的量化方案,并在PixArt和Hunyuan等模型上生成高质量的图像。
English Summary: FP4DiT introduces a post-training quantization method using floating-point quantization to achieve W4A6 precision for Diffusion Transformers, outperforming integer-based approaches and generating high-quality images on models like PixArt and Hunyuan.

Authors:Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan
Title: SkyLadder: Better and Faster Pretraining via Context Window Scheduling
Abstract:
Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.
中文摘要:该研究提出SkyLadder方法,通过在预训练中采用从短到长的上下文窗口过渡策略,在保持强大长文本处理能力的同时,实现了基准测试性能提升最高达3.7%,训练速度比基线方法快22%。
English Summary: The study introduces SkyLadder, a method that transitions from short to long context windows during pretraining, achieving up to 3.7% better performance on benchmarks and 22% faster training than baselines while maintaining strong long-context capabilities.

Authors:Yang Tan, Chen Liu, Jingyuan Gao, Banghao Wu, Mingchen Li, Ruilin Wang, Lingrong Zhang, Huiqun Yu, Guisheng Fan, Liang Hong, Bingxin Zhou
Title: VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning
Abstract:
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.
中文: VenusFactory是一个多功能引擎,通过整合生物数据检索、任务基准测试和模块化微调,解决了蛋白质语言建模中的跨学科挑战,并提供命令行和无代码界面及丰富的数据集与模型。
English: VenusFactory is a versatile engine that addresses interdisciplinary challenges in protein language modeling by integrating biological data retrieval, task benchmarking, and modular fine-tuning, offering both command-line and no-code interfaces with extensive datasets and models.

Authors:Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li
Title: Visual Position Prompt for MLLM based Visual Grounding
Abstract:
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization.To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets. The code and dataset are available at https://github.com/WayneTomas/VPP-LLaVA.
中文: VPP-LLaVA通过引入视觉位置提示机制,有效解决了多模态大语言模型在图像空间定位中的不足,利用精简数据集训练后在视觉 grounding 任务中达到领先水平并展现强大泛化能力。
English: VPP-LLaVA enhances multimodal large language models by integrating Visual Position Prompts to address spatial alignment challenges, achieving state-of-the-art performance in visual grounding tasks through efficient training with a compact dataset.

Authors:Yuchen Ren, Zhengyu Zhao, Chenhao Lin, Bo Yang, Lu Zhou, Zhe Liu, Chao Shen
Title: Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Abstract:
Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at https://github.com/RYC-98/FPR.
Chinese: 本文提出针对视觉Transformer的前向传播优化方法,通过多样化注意力图和稳定令牌嵌入来提升对抗样本的可迁移性,平均性能优于现有反向优化方法达7.0%。
English: This paper introduces Forward Propagation Refinement (FPR) for Vision Transformers, which enhances adversarial transferability by diversifying attention maps and stabilizing token embeddings, outperforming existing backward-based methods by up to 7.0% on average.

Authors:Pieter Pas, Panagiotis Patrinos
Title: Blocked Cholesky factorization updates of the Riccati recursion using hyperbolic Householder transformations
Abstract:
Newton systems in quadratic programming (QP) methods are often solved using direct Cholesky or LDL factorizations. When the linear systems in successive iterations differ by a low-rank modification (as is common in active set and augmented Lagrangian methods), updating the existing factorization can offer significant performance improvements over recomputing a full Cholesky factorization. We review the hyperbolic Householder transformation, and demonstrate its usefulness in describing low-rank Cholesky factorization updates. By applying this hyperbolic Householder-based framework to the well-known Riccati recursion for solving saddle-point problems with optimal control structure, we develop a novel algorithm for updating the factorizations used in optimization solvers for optimal control. Specifically, the proposed method can be used to efficiently solve the semismooth Newton systems that are at the core of the augmented Lagrangian-based QPALM-OCP solver. An optimized open-source implementation of the proposed factorization update routines is provided as well.
中文: 该摘要提出了一种基于双曲Householder变换的新算法,用于高效更新二次规划方法中的Cholesky分解,特别通过处理低秩修正避免完全重计算,从而优化了QPALM-OCP等最优控制求解器的性能。
English: The abstract presents a novel algorithm using hyperbolic Householder transformations to efficiently update Cholesky factorizations in quadratic programming methods, particularly benefiting optimal control solvers like QPALM-OCP by handling low-rank modifications without full recomputation.

Authors:Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei
Title: Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
Abstract:
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.
Chinese: RAM框架通过LLA恢复局部语义和KCOT优化区域与标签匹配,有效解决了开放词汇多标签识别中的关键难题,在多个数据集上实现了最先进的性能。
English: The RAM framework addresses open-vocabulary multi-label recognition challenges by introducing LLA to restore local semantics and KCOT to optimize region-label matching, achieving state-of-the-art results across multiple datasets.

Authors:Junnan Zhu, Min Xiao, Yining Wang, Feifei Zhai, Yu Zhou, Chengqing Zong
Title: TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification
Abstract:
LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT-4o provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation. We make our dataset available here: https://github.com/ZNLP/ZNLP-Dataset.
中文: TROVE挑战通过将目标句子溯源至具体来源并标注细粒度关系,评估表明检索增强和大模型在复杂场景中提升性能,同时开源模型展现出潜力。
English: The TROVE challenge is introduced to trace text provenance by linking target sentences to their sources and annotating fine-grained relationships, with evaluations showing retrieval augmentation and larger models enhance performance in complex scenarios.

Authors:Yuanchao Yue, Hui Yuan, Zhengxin Li, Shuai Li, Wei Zhang
Title: EEPNet-V2: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image
Abstract:
The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve high accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Experimental results demonstrate the the proposed method achieves real-time performance and extremely high registration accuracy. Specifically, on the KITTI dataset, our model achieves a registration accuracy rate of over 99\%. Our code is released at: https://github.com/ESRSchao/EEPNet-V2.
中文: 该框架通过将激光雷达点云投影为二维表示与相机图像匹配,采用多尺度特征提取和像素块匹配网络,在KITTI数据集上实现了实时处理性能并达到超过99%的配准精度。
English: The proposed framework addresses cross-modal registration challenges by projecting LiDAR point clouds into 2D representations for matching with camera images, achieving real-time performance and over 99% accuracy on the KITTI dataset through multi-scale feature extraction and patch-to-pixel matching networks.

Authors:David Wan, Justin Chih-Yao Chen, Elias Stengel-Eskin, Mohit Bansal
Title: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
Abstract:
Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.
Chinese Summary: 多智能体多模型协作通过迭代检测和修正错误,显著提升了摘要和长问答等长文本生成任务中的忠实度,其中MAMM-Refine方法在多个数据集上验证了其有效性和泛化能力。
English Summary: Multi-agent multi-model collaboration enhances long-form generation tasks by refining outputs for improved faithfulness, with the MAMM-Refine method demonstrating significant performance gains in summarization and question-answering through iterative error detection and correction.

Authors:Chentian Wei, Jiewei Chen, Jinzhu Xu
Title: Exploring Large Language Models for Word Games:Who is the Spy?
Abstract:
Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct-wei/Who-is-The-Spy.
中文: 本研究提出了一种无需训练的思维链调度框架,使大语言模型在词语游戏“谁是卧底”中表现出色,有效提升了情境推理和社交互动能力。
English: This study introduces a training-free framework using Chain-of-Thought reasoning to enable large language models to excel in the word game "Who is the Spy," demonstrating improved performance in situational reasoning and social interaction tasks.

Authors:Zechuan Li, Hongshan Yu, Yihao Ding, Jinhao Qiao, Basim Azam, Naveed Akhtar
Title: GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
Abstract:
We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.
中文: GO-N3RDet是一种创新的多视角3D物体检测器,通过神经辐射场与场景几何优化相结合,并采用独特的体素增强和采样机制,实现了最先进的检测性能。
English: GO-N3RDet is a novel multi-view 3D object detector that integrates neural radiance fields with scene geometry optimization, achieving state-of-the-art performance through specialized voxel enhancement and sampling mechanisms.

Authors:Sejong Kim, Hyunseo Song, Hyunwoo Seo, Hyunjun Kim
Title: Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a promising framework to mitigate hallucinations in Large Language Models (LLMs), yet its overall performance is dependent on the underlying retrieval system. In the finance domain, documents such as 10-K reports pose distinct challenges due to domain-specific vocabulary and multi-hierarchical tabular data. In this work, we introduce an efficient, end-to-end RAG pipeline that enhances retrieval for financial documents through a three-phase approach: pre-retrieval, retrieval, and post-retrieval. In the pre-retrieval phase, various query and corpus preprocessing techniques are employed to enrich input data. During the retrieval phase, we fine-tuned state-of-the-art (SOTA) embedding models with domain-specific knowledge and implemented a hybrid retrieval strategy that combines dense and sparse representations. Finally, the post-retrieval phase leverages Direct Preference Optimization (DPO) training and document selection methods to further refine the results. Evaluations on seven financial question answering datasets-FinDER, FinQABench, FinanceBench, TATQA, FinQA, ConvFinQA, and MultiHiertt-demonstrate substantial improvements in retrieval performance, leading to more accurate and contextually appropriate generation. These findings highlight the critical role of tailored retrieval techniques in advancing the effectiveness of RAG systems for financial applications. A fully replicable pipeline is available on GitHub: https://github.com/seohyunwoo-0407/GAR.
中文: 本研究提出了一种高效端到端的检索增强生成流程,通过三阶段方法优化金融文档检索,在多个金融问答数据集上显著提升了性能。
English: This study introduces an efficient end-to-end Retrieval-Augmented Generation pipeline that enhances financial document retrieval through a three-phase approach, significantly improving performance on financial question answering datasets.

Authors:Ananya Garg, Mohmmad Ayaan, Swara Parekh, Vikranth Udandarao
Title: Food Delivery Time Prediction in Indian Cities Using Machine Learning Models
Abstract:
Accurate prediction of food delivery times significantly impacts customer satisfaction, operational efficiency, and profitability in food delivery services. However, existing studies primarily utilize static historical data and often overlook dynamic, real-time contextual factors crucial for precise prediction, particularly in densely populated Indian cities. This research addresses these gaps by integrating real-time contextual variables such as traffic density, weather conditions, local events, and geospatial data (restaurant and delivery location coordinates) into predictive models. We systematically compare various machine learning algorithms, including Linear Regression, Decision Trees, Bagging, Random Forest, XGBoost, and LightGBM, on a comprehensive food delivery dataset specific to Indian urban contexts. Rigorous data preprocessing and feature selection significantly enhanced model performance. Experimental results demonstrate that the LightGBM model achieves superior predictive accuracy, with an R2 score of 0.76 and Mean Squared Error (MSE) of 20.59, outperforming traditional baseline approaches. Our study thus provides actionable insights for improving logistics strategies in complex urban environments. The complete methodology and code are publicly available for reproducibility and further research.
Chinese: 本研究通过整合交通、天气等实时因素与机器学习算法,提升了印度城市外卖送达时间的预测准确性,其中LightGBM模型以R²得分0.76表现最优,为复杂城市环境下的物流优化提供了可行方案。
English: This study enhances food delivery time prediction in Indian cities by integrating real-time factors like traffic and weather with machine learning, where LightGBM achieved the highest accuracy with an R² score of 0.76, offering practical logistics improvements.

Authors:Àlex Pujol Vidal, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund
Title: Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU
Abstract:
Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC
中文: 本文提出一种针对双曲对比学习模型的机器遗忘方法,通过双曲空间特有的技术重组语义层次结构,在保持保留概念性能的同时实现了更优越的概念消除效果。
English: This paper introduces a machine unlearning method for hyperbolic contrastive learning models, demonstrating superior concept removal through hyperbolic-specific techniques that reorganize semantic hierarchies while maintaining performance on retained concepts.

Authors:Yang Li, Soumya Snigdha Kundu, Maxence Boels, Toktam Mahmoodi, Sebastien Ourselin, Tom Vercauteren, Prokar Dasgupta, Jonathan Shapey, Alejandro Granados
Title: UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework
Abstract:
Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at https://github.com/KCL-BMEIS/UltraFlwr.
中文:UltraFlwr提出了一种用于医疗目标检测的联邦学习框架,可在不共享数据的情况下实现分布式训练,其YOLO-PA策略将通信成本降低83%的同时保持与完全聚合相当的性能。
English: UltraFlwr introduces a federated learning framework for medical object detection that enables decentralized training without data sharing, while its YOLO-PA strategy reduces communication costs by 83% and maintains performance comparable to full aggregation.

Authors:Joost Luijmes, Alexander Gielisse, Roman Knyazhitskiy, Jan van Gemert
Title: ARC: Anchored Representation Clouds for High-Resolution INR Classification
Abstract:
Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.
中文: 提出的ARC架构通过在图像空间中引入局部锚定潜在向量,克服了现有隐式神经表示的局限,实现了跨分辨率的最先进图像分类和更强的变换鲁棒性。
English: The proposed ARC architecture introduces locally anchored latent vectors in image-space to overcome the limitations of current implicit neural representations, achieving state-of-the-art image classification across resolutions and enhanced robustness to transformations.

Authors:Xing He, Zhe Zhu, Liangliang Nan, Honghua Chen, Jing Qin, Mingqiang Wei
Title: PointSFDA: Source-free Domain Adaptation for Point Cloud Completion
Abstract:
Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbf{PointSFDA}. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph{\textcolor{magenta}{https://github.com/Starak-x/PointSFDA}}.
中文摘要:本文提出PointSFDA框架,通过粗到精的知识蒸馏迁移全局几何信息,并采用自监督局部掩码一致性训练学习目标域局部几何,有效提升了点云补全在跨域场景中的性能。
English Summary: The paper introduces PointSFDA, a source-free domain adaptation framework that enhances point cloud completion by transferring global geometry knowledge through coarse-to-fine distillation and learning local geometry via self-supervised consistency training, significantly improving cross-domain performance.

Authors:Nikola Đukić, Tim Lebailly, Tinne Tuytelaars
Title: Object-Centric Pretraining via Target Encoder Bootstrapping
Abstract:
Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.
Chinese: OCEBO方法提出了一种自蒸馏框架,通过将目标编码器更新为以目标为中心的模型的指数移动平均值,首次实现了在真实世界数据上从头训练以目标为中心的模型,克服了固定目标带来的性能限制同时保持了目标为中心的归纳偏差。
English: The proposed OCEBO method introduces a self-distillation framework that enables training object-centric models from scratch on real-world data by updating the target encoder as an exponential moving average of the object-centric model, overcoming performance limitations of frozen targets while maintaining object-centric inductive biases.

Authors:Alejandro Almodóvar, Adrián Javaloy, Juan Parras, Santiago Zazo, Isabel Valera
Title: DeCaFlow: A Deconfounding Causal Generative Model
Abstract:
We introduce DeCaFlow, a deconfounding causal generative model. Training once per dataset using just observational data and the underlying causal graph, DeCaFlow enables accurate causal inference on continuous variables under the presence of hidden confounders. Specifically, we extend previous results on causal estimation under hidden confounding to show that a single instance of DeCaFlow provides correct estimates for all causal queries identifiable with do-calculus, leveraging proxy variables to adjust for the causal effects when do-calculus alone is insufficient. Moreover, we show that counterfactual queries are identifiable as long as their interventional counterparts are identifiable, and thus are also correctly estimated by DeCaFlow. Our empirical results on diverse settings (including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries) show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box applicability to any given causal graph. An implementation can be found in https://github.com/aalmodovares/DeCaFlow
中文: DeCaFlow是一种解混因果生成模型,通过单次训练即可基于观测数据准确估计所有可识别的因果与反事实查询,在多种场景下均优于现有方法。
English: DeCaFlow is a deconfounding causal generative model that accurately estimates all identifiable causal and counterfactual queries from observational data using a single training instance, outperforming existing methods across diverse settings.

Authors:Zinqin Huang, Gu Wang, Chenyangguang Zhang, Ruida Zhang, Xiu Li, Xiangyang Ji
Title: GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation
Abstract:
Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at https://github.com/ziqin-h/GIVEPose.
中文: 本文提出GIVEPose框架,通过引入类内无差异共识坐标表示解决NOCS映射在类别级物体姿态估计中的不足,在合成与真实数据集上显著优于当前最先进的RGB方法。
English: This paper introduces GIVEPose, a framework that overcomes the limitations of NOCS maps in category-level object pose estimation by proposing an Intra-class Variation-Free Consensus map and achieves superior performance compared to existing RGB-based methods.

Authors:Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang
Title: When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
Abstract:
The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at https://github.com/yafeng19/T-CORE.
中文: T-CoRe框架通过三明治采样策略降低时序不确定性,并采用自蒸馏架构在潜在空间恢复表征,有效解决了视频自监督学习中的关键挑战,在多项下游任务中表现优异。
English: The proposed T-CoRe framework tackles self-supervised video learning challenges by introducing sandwich sampling to reduce temporal uncertainty and a self-distillation branch for latent-space representation recovery, achieving superior performance across multiple downstream tasks.

Authors:Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, Dacheng Tao
Title: Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Abstract:
This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at https://github.com/NY1024/DeepSeek-Safety-Eval.
中文总结:本研究首次对DeepSeek模型进行全面安全评估,发现尽管具备强大通用能力,这些模型在多个风险维度仍存在显著安全漏洞。
English Summary: This study conducts the first comprehensive safety evaluation of DeepSeek models, revealing significant safety vulnerabilities across multiple risk dimensions despite their strong general capabilities.

Authors:Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva
Title: Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis
Abstract:
While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
中文: Sorcen提出了一种结合对比与重建目标的统一自监督学习框架,无需外部标记器或额外图像裁剪,在多个基准测试中实现了更高的效率和性能突破。
English: Sorcen introduces a unified self-supervised learning framework with a synergic contrastive-reconstruction objective, eliminating the need for external tokenizers and additional image crops while outperforming previous methods in efficiency and performance across multiple benchmarks.

Authors:Cheng Wang, Lingxin Kong, Massimiliano Tamborski, Stefano V. Albrecht
Title: HAD-Gen: Human-like and Diverse Driving Behavior Modeling for Controllable Scenario Generation
Abstract:
Simulation-based testing has emerged as an essential tool for verifying and validating autonomous vehicles (AVs). However, contemporary methodologies, such as deterministic and imitation learning-based driver models, struggle to capture the variability of human-like driving behavior. Given these challenges, we propose HAD-Gen, a general framework for realistic traffic scenario generation that simulates diverse human-like driving behaviors. The framework first clusters the vehicle trajectory data into different driving styles according to safety features. It then employs maximum entropy inverse reinforcement learning on each of the clusters to learn the reward function corresponding to each driving style. Using these reward functions, the method integrates offline reinforcement learning pre-training and multi-agent reinforcement learning algorithms to obtain general and robust driving policies. Multi-perspective simulation results show that our proposed scenario generation framework can simulate diverse, human-like driving behaviors with strong generalization capability. The proposed framework achieves a 90.96% goal-reaching rate, an off-road rate of 2.08%, and a collision rate of 6.91% in the generalization test, outperforming prior approaches by over 20% in goal-reaching performance. The source code is released at https://github.com/RoboSafe-Lab/Sim4AD.
中文: HAD-Gen 是一个创新框架,通过聚类轨迹数据和强化学习模拟多样化类人驾驶行为来生成真实交通场景,在泛化测试中实现了90.96%的目标到达率,性能超越现有方法20%以上。
English: HAD-Gen is a novel framework that generates realistic traffic scenarios by simulating diverse human-like driving behaviors through clustering trajectory data and reinforcement learning, achieving superior performance with a 90.96% goal-reaching rate and outperforming previous methods by over 20%.

Authors:Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie
Title: SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection
Abstract:
The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets, code and prompts can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.
中文:SPADE框架通过基于提示的正负样本提出结构化合成对话检测方法,构建的新数据集提升了检测模型的泛化能力,为大型语言模型应用安全提供了实用解决方案。
English: The SPADE framework introduces a structured approach using prompt-based samples to enhance synthetic dialogue detection, creating new datasets that improve model generalization and security for LLM applications.

Authors:Saad Lahlali, Sandra Kara, Hejer Ammar, Florian Chabot, Nicolas Granger, Hervé Le Borgne, Quoc-Cuong Pham
Title: xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion
Abstract:
Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at https://github.com/CEA-LIST/xMOD
中文: 本文提出了一种新颖框架,通过利用2D运动线索来弥合2D与3D物体发现之间的鸿沟,借助跨模态训练和场景补全技术实现了显著的性能提升。
English: This paper introduces a novel framework that bridges 2D and 3D object discovery by leveraging 2D motion cues, achieving significant performance improvements through cross-modal training and scene completion techniques.

Authors:Yunwei Lan, Zhigao Cui, Chang Liu, Jialun Peng, Nian Wang, Xin Luo, Dong Liu
Title: Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training
Abstract:
Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code https://github.com/ywxjm/Diff-Dehazer.
中文总结:Diff-Dehazer框架通过将扩散模型先验与物理知识融入CycleGAN,利用图像和文本模态的退化去除机制,在非配对图像去雾任务中实现了卓越性能。
English Summary: The proposed Diff-Dehazer framework leverages diffusion model priors integrated with physical knowledge within CycleGAN to achieve superior unpaired image dehazing by effectively removing degradation across image and text modalities.

Authors:Michael Neri, Federica Battisti
Title: Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features
Abstract:
During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA's light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at https://github.com/michaelneri/PST-PCQA.
中文: 本研究提出PST-PCQA无参考质量评估方法,通过分析点云块并结合局部与全局特征来预测质量,在多个数据集上表现优异,其轻量级结构适用于实时应用场景。
English: This study presents PST-PCQA, a no-reference quality metric that evaluates point clouds by analyzing patches with local and global features, demonstrating effective performance across datasets with a lightweight structure suitable for real-time applications.

Authors:Matthew Low, Arian Prabowo, Hao Xue, Flora Salim
Title: Embedding spatial context in urban traffic forecasting with contrastive pre-training
Abstract:
Urban traffic forecasting is a commonly encountered problem, with wide-ranging applications in fields such as urban planning, civil engineering and transport. In this paper, we study the enhancement of traffic forecasting with pre-training, focusing on spatio-temporal graph methods. While various machine learning methods to solve traffic forecasting problems have been explored and extensively studied, there is a gap of a more contextual approach: studying how relevant non-traffic data can improve prediction performance on traffic forecasting problems. We call this data spatial context. We introduce a novel method of combining road and traffic information through the notion of a traffic quotient graph, a quotient graph formed from road geometry and traffic sensors. We also define a way to encode this relationship in the form of a geometric encoder, pre-trained using contrastive learning methods and enhanced with OpenStreetMap data. We introduce and discuss ways to integrate this geometric encoder with existing graph neural network (GNN)-based traffic forecasting models, using a contrastive pre-training paradigm. We demonstrate the potential for this hybrid model to improve generalisation and performance with zero additional traffic data. Code for this paper is available at https://github.com/mattchrlw/forecasting-on-new-roads.
中文: 本文提出了一种新颖的交通预测方法,通过结合道路几何和开放街道地图数据预训练几何编码器来增强时空图模型,无需额外交通数据即可提升预测性能。
English: This paper introduces a novel traffic forecasting method that enhances spatio-temporal graph models by pre-training a geometric encoder with road geometry and OpenStreetMap data, improving performance without requiring additional traffic data.

Authors:Yaxiong Chen, Junjian Hu, Chunlei Li, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks
Abstract:
Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.
中文:研究者提出了一种时序对比记忆网络,用于一次性医学视频对象分割,仅需首帧标注即可完成全视频分割,在多样化医学数据集上实现了领先性能,显著减轻了标注负担。
English: The authors propose a temporal contrastive memory network for one-shot medical video object segmentation, which utilizes a single annotated frame to segment entire videos and demonstrates state-of-the-art performance on a diverse medical dataset, effectively reducing annotation burdens.

Authors:Zihan Cao, Yu Zhong, Liang-Jian Deng
Title: Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening
Abstract:
Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at https://github.com/294coder/PAN-OTFM.
Chinese: 本文提出的最优传输流匹配(OTFM)框架利用非平衡最优传输实现单步高质量全色锐化,在保持计算效率的同时超越了现有方法的性能。
English: This paper introduces the Optimal Transport Flow Matching (OTFM) framework, which leverages unbalanced optimal transport to achieve high-quality pansharpening in a single sampling step, outperforming existing methods while maintaining computational efficiency.

Authors:Yifan Li, Shuai Yang, Jiaying Liu
Title: Language-based Image Colorization: A Benchmark and Beyond
Abstract:
Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at https://github.com/lyf1212/Color-Turbo.
中文摘要:本文首次对基于语言的图像上色方法进行全面综述与基准测试,提出了一种简单高效的蒸馏扩散模型,在实现14倍加速的同时超越了先前复杂方法的性能表现。
English Summary: This paper presents the first comprehensive review and benchmark of language-based image colorization methods, proposing a simple yet effective distilled diffusion model that outperforms previous approaches with significantly faster processing.

Authors:Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models
Abstract:
Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at https://github.com/MedAITech/U_I2V.
中文摘要:本研究提出一种潜在动态扩散模型(LDDM),通过将静态超声图像转化为动态视频序列来解决数据集稀缺问题,结合真实数据训练能显著提升视频分类模型的诊断性能。
English Summary: The study introduces a latent dynamic diffusion model (LDDM) that synthesizes realistic ultrasound videos from static images to address dataset scarcity, significantly enhancing video classification performance when combined with real data.

Authors:Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
Title: Continual Multimodal Contrastive Learning
Abstract:
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. By leveraging contrastive learning across diverse modalities, large-scale multimodal data enhances representational quality. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. Instead, emergent multimodal data can be used to optimize existing models gradually, i.e., models are trained on a sequence of modality pair data. We define this problem as Continual Multimodal Contrastive Learning (CMCL), an underexplored yet crucial research direction at the intersection of multimodal and continual learning. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge. Two upper bounds provide theoretical insights on both stability and plasticity in our solution. Beyond our theoretical contributions, we conduct experiments on multiple datasets by comparing our method against advanced continual learning baselines. The empirical results further support our claims and demonstrate the efficacy of our method. Our codes are available at https://github.com/Xiaohao-Liu/CMCL.
中文: 本文提出持续多模态对比学习(CMCL),通过一种新颖的优化方法解决序列多模态数据的高效模型更新难题,该方法在保持稳定性和可塑性的同时,得到了理论分析和实验验证的双重支持。
English: This paper introduces Continual Multimodal Contrastive Learning (CMCL), addressing the challenge of efficiently updating models with sequential multimodal data through a novel optimization method that ensures stability and plasticity, supported by both theoretical analysis and experimental validation.

Authors:Zixuan Zheng, Yilei Shi, Chunlei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning
Abstract:
Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.
中文: 本研究提出一个两阶段框架用于少样本医学视频目标分割,先利用标注图像训练模型,再通过时空一致性再学习和跨模型约束,在极少量视频标注下实现优越的分割性能。
English: This study introduces a two-phase framework for few-shot medical video object segmentation that first trains on labeled images and then applies spatiotemporal consistency relearning with cross-model constraints to achieve superior performance with minimal video annotations.

Authors:Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando
Title: Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Abstract:
We introduce PKR-QA (Procedural Knowledge Reasoning Question Answering), a new benchmark for question answering over procedural tasks that require structured reasoning. PKR-QA is constructed semi-automatically using a procedural knowledge graph (PKG), which encodes task-specific knowledge across diverse domains. The PKG is built by curating and linking information from the COIN instructional video dataset and the ontology, enriched with commonsense knowledge from ConceptNet and structured outputs from Large Language Models (LLMs), followed by manual verification. To generate question-answer pairs, we design graph traversal templates where each template is applied systematically over PKG. To enable interpretable reasoning, we propose a neurosymbolic approach called Knowledge Module Learning (KML), which learns procedural relations via neural modules and composes them for structured reasoning with LLMs. Experiments demonstrate that this paradigm improves reasoning performance on PKR-QA and enables step-by-step reasoning traces that facilitate interpretability. Code and dataset will be released soon https://github.com/LUNAProject22/KML.
中文: 我们推出了PKR-QA,这是一个基于融合常识与大型语言模型数据的知识图谱半自动构建的程序性问答基准,并提出了KML这一神经符号方法,有效提升了推理性能与可解释性。
English: We introduce PKR-QA, a benchmark for procedural question answering built semi-automatically from a knowledge graph enriched with commonsense and LLM data, and propose KML, a neurosymbolic method that enhances reasoning performance and interpretability.

Authors:Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng
Title: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance
Abstract:
Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at https://github.com/294coder/MMAIF.
中文: 本文提出了一种统一的图像融合框架,通过引入模拟真实退化的流程和潜在空间中的一体化扩散变换器,解决了现有方法的关键局限,实现了多任务、多退化条件下的语言引导融合,并展现出卓越性能。
English: This paper introduces a unified framework for image fusion that addresses key limitations of existing methods by incorporating a degradation pipeline and an all-in-one Diffusion Transformer in latent space, enabling multi-task, multi-degradation processing with language guidance and demonstrating superior performance.

Authors:Siyuan Yan, Ming Hu, Yiwen Jiang, Xieji Li, Hao Fei, Philipp Tschandl, Harald Kittler, Zongyuan Ge
Title: Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Abstract:
The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be publicly available at https://github.com/SiyuanYan1/Derm1M upon acceptance.
Chinese: Derm1M数据集通过提供超过一百万张带有丰富临床背景的图像-文本对,解决了皮肤病学中缺乏大规模视觉-语言数据的问题,并基于此开发的DermLIP模型在多项诊断任务中实现了最先进的性能。
English: The Derm1M dataset addresses the lack of large-scale vision-language data in dermatology by providing over one million image-text pairs with rich clinical context, enabling the development of DermLIP models that achieve state-of-the-art performance across multiple diagnostic tasks.

Authors:Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, Weijia Li
Title: Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation
Abstract:
With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: https://github.com/opendatalab/FakeVLM.
Chinese: 针对现有合成图像检测方法的不足,FakeVLM作为多模态模型不仅能有效检测伪造图像,还能提供可解释的自然语言说明,并得到FakeClue数据集的全面支持。
English: To address the limitations of existing synthetic image detection methods, FakeVLM is introduced as a multimodal model that excels in detecting fakes and providing interpretable natural language explanations, supported by the comprehensive FakeClue dataset.

Authors:Honglin Lin, Zhuoshi Pan, Yu Li, Qizhi Pei, Xin Gao, Mengzhang Cai, Conghui He, Lijun Wu
Title: MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer
Abstract:
Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at https://github.com/LHL3341/MetaLadder.
中文摘要:MetaLadder框架通过让大语言模型在解决问题前先回忆类似问题及其推理过程,并采用问题重述机制增强理解,实现了类比推理迁移,在数学基准测试中比标准方法显著提升10.3%的准确率。
English Summary: The proposed MetaLadder framework enhances LLMs' mathematical reasoning by prompting them to recall analogous problems and their solutions before addressing target tasks, achieving a 10.3% accuracy improvement over standard methods through problem-restating and analogical reasoning.

Authors:Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, Xu-Cheng Yin
Title: DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
Abstract:
Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.
中文: 本文提出了DPFlow,一种仅用低分辨率样本训练即可泛化至8K输入的自适应光流架构,并引入Kubric-NK新基准来评估1K至8K分辨率的光流方法,在多个基准测试中取得了最优结果。
English: This paper introduces DPFlow, an adaptive optical flow architecture that generalizes to 8K resolution using only low-resolution training data, and Kubric-NK, a new benchmark for evaluating optical flow methods across resolutions from 1K to 8K, achieving state-of-the-art results on multiple benchmarks.

Authors:Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar, Xiangyang Ji, Xu-Cheng Yin
Title: DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
Abstract:
Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.
中文: 本文提出了DPFlow,一种仅用低分辨率样本训练即可泛化至8K输入的自适应光流架构,并引入Kubric-NK新基准来评估1K至8K分辨率的光流方法,在多个基准测试中取得了最优结果。
English: This paper introduces DPFlow, an adaptive optical flow architecture that generalizes to 8K resolution using only low-resolution training data, and Kubric-NK, a new benchmark for evaluating optical flow methods across resolutions from 1K to 8K, achieving state-of-the-art results on multiple benchmarks.

Authors:Yuhang Liu, Wenjie Zhao, Yunhui Guo
Title: H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection
Abstract:
Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \href{https://github.com/YuhangLiuu/H2ST}{https://github.com/YuhangLiuu/H2ST}.
中文: 本文提出分层双样本检验(H2ST)这一新型持续离群检测方法,通过假设检验避免阈值选择,利用特征映射实现任务级识别,在开放世界任务增量学习场景中展现出优于现有方法的性能与更低开销。
English: This paper introduces Hierarchical Two-sample Tests (H2ST), a novel continual out-of-distribution detection method that addresses challenges in open-world Task Incremental Learning by eliminating threshold selection through hypothesis testing and enabling task-level identification with superior performance and lower overhead.

Authors:Jeff Jewett, Sandhya Saisubramanian
Title: Learning with Expert Abstractions for Efficient Multi-Task Continuous Control
Abstract:
Decision-making in complex, continuous multi-task environments is often hindered by the difficulty of obtaining accurate models for planning and the inefficiency of learning purely from trial and error. While precise environment dynamics may be hard to specify, human experts can often provide high-fidelity abstractions that capture the essential high-level structure of a task and user preferences in the target environment. Existing hierarchical approaches often target discrete settings and do not generalize across tasks. We propose a hierarchical reinforcement learning approach that addresses these limitations by dynamically planning over the expert-specified abstraction to generate subgoals to learn a goal-conditioned policy. To overcome the challenges of learning under sparse rewards, we shape the reward based on the optimal state value in the abstract model. This structured decision-making process enhances sample efficiency and facilitates zero-shot generalization. Our empirical evaluation on a suite of procedurally generated continuous control environments demonstrates that our approach outperforms existing hierarchical reinforcement learning methods in terms of sample efficiency, task completion rate, scalability to complex tasks, and generalization to novel scenarios.
中文摘要:本文提出一种分层强化学习方法,利用专家提供的抽象模型规划子目标并重塑奖励机制,在连续控制任务中显著提升样本效率并实现零样本泛化能力。
English Summary: This paper introduces a hierarchical reinforcement learning method that leverages expert-provided abstractions to plan subgoals and shape rewards, improving sample efficiency and enabling zero-shot generalization in continuous control tasks.

Authors:Fatemeh Dehrouyeh, Ibrahim Shaer, Soodeh Nikan, Firouz Badrkhani Ajaei, Abdallah Shami
Title: Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure
Abstract:
With the growing need for real-time processing on IoT devices, optimizing machine learning (ML) models' size, latency, and computational efficiency is essential. This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting Electric Vehicle Charging Infrastructure (EVCI). Using the CICEVSE2024 dataset, we trained and optimized three models-Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), and XGBoost-through hyperparameter tuning with Optuna, further refining them using SHapley Additive exPlanations (SHAP)-based feature selection (FS) and unstructured pruning techniques. The optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance. Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
中文: 本研究通过采用基于SHAP的特征选择和剪枝技术,开发了针对电动汽车充电基础设施异常检测的优化机器学习模型,在性能损失极小的前提下显著减小了模型规模并缩短了推理时间。
English: This study develops optimized machine learning models for anomaly detection in Electric Vehicle Charging Infrastructure by employing SHAP-based feature selection and pruning techniques, achieving substantial reductions in model size and inference time with minimal performance loss.

Authors:Jake Fawkes, Michael O'Riordan, Athanasios Vlontzos, Oriol Corcoll, Ciarán Mark Gilligan-Lee
Title: The Hardness of Validating Observational Studies with Experimental Data
Abstract:
Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code available}.
中文: 本研究证明,尽管实验数据能够识别观察性因果估计中的偏差,但若无额外假设通常无法验证或修正这些估计,并提出了一种高斯过程方法来高概率地估计处理效应。
English: This study demonstrates that while experimental data can identify bias in observational causal estimates, it generally cannot validate or correct them without additional assumptions, proposing a Gaussian Process method to estimate treatment effects with high probability.

Authors:Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh
Title: Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution
Abstract:
Single-image super-resolution (SISR) is a fundamental problem in computer vision that aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Although convolutional neural networks (CNNs) have achieved substantial advancements, deeper architectures often introduce excessive parameters, higher memory usage, and computational cost, limiting their applicability on resource-constrained devices. Recent research has thus focused on lightweight architectures that preserve accuracy while reducing complexity. This paper presents the Involution and BSConv Multi-Depth Distillation Network (IBMDN), a lightweight and effective architecture for SISR. The proposed IBMDN comprises Involution and BSConv Multi-Depth Distillation Blocks (IBMDB) and a Contrast and High-Frequency Attention Block (CHFAB). IBMDB employs varying combinations of Involution and BSConv at multiple depths to perform efficient feature extraction while minimizing computational complexity. CHFAB, a lightweight self-attention mechanism, focuses on extracting high-frequency and contrast information to enhance perceptual quality in the reconstructed images. The flexible design of IBMDB enables it to be seamlessly integrated into diverse SISR frameworks, including information distillation, transformer-based, and GAN-based models. Extensive experiments demonstrate that incorporating IBMDB significantly reduces memory usage, parameters, and floating-point operations (FLOPs), while achieving improvements in both pixel-wise accuracy and visual quality. The source code is available at: https://github.com/akramkhatami/IBMDN.
中文: 本文提出IBMDN轻量级单图像超分辨率网络,通过多深度蒸馏块和注意力机制在降低计算资源消耗的同时,有效提升重建图像的像素精度与视觉质量。
English: The paper introduces IBMDN, a lightweight single-image super-resolution network that uses innovative blocks to reduce computational costs while enhancing image quality through efficient feature extraction and attention mechanisms.

Authors:Sebastian Zhao, Alan Zhu, Hussein Mozannar, David Sontag, Ameet Talwalkar, Valerie Chen
Title: CodingGenie: A Proactive LLM-Powered Programming Assistant
Abstract:
While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, based on the current code context and allows users to customize suggestions by providing a task description and selecting what suggestions are shown. We demonstrate multiple use cases to show how proactive suggestions from CodingGenie can improve developer experience, and also analyze the cost of adding proactivity. We believe this open-source tool will enable further research into proactive assistants. CodingGenie is open-sourced at https://github.com/sebzhao/CodingGenie/ and video demos are available at https://sebzhao.github.io/CodingGenie/.
中文: CodingGenie是一款集成在代码编辑器中的主动式编程助手,能根据当前代码上下文自主提供从错误修复到单元测试等建议,并通过开源方式推动主动辅助工具的深入研究。
English: CodingGenie is a proactive coding assistant that autonomously provides contextual suggestions like bug fixes and unit tests within code editors, enhancing developer workflows through customizable, open-source integration.

Authors:Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang
Title: DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis
Abstract:
Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.
中文摘要:本文提出DPImageBench基准测试,旨在解决差分隐私图像合成中评估标准不一致的问题,通过系统评估方法、数据集和指标,揭示了预训练策略与噪声添加技术的重要发现。
English Summary: This paper introduces DPImageBench, a comprehensive benchmark addressing inconsistent evaluation protocols in differentially private image synthesis, which systematically evaluates methods, datasets, and metrics while revealing key insights about pretraining strategies and noise addition techniques.

Authors:Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang
Title: DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis
Abstract:
Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.
中文摘要:本文提出DPImageBench基准测试,旨在解决差分隐私图像合成中评估标准不一致的问题,通过系统评估方法、数据集和指标,揭示了预训练策略与噪声添加技术的重要发现。
English Summary: This paper introduces DPImageBench, a comprehensive benchmark addressing inconsistent evaluation protocols in differentially private image synthesis, which systematically evaluates methods, datasets, and metrics while revealing key insights about pretraining strategies and noise addition techniques.

Authors:Yicheng Fu, Zikui Wang, Liuxin Yang, Meiqing Huo, Zhongdongming Dai
Title: ConQuer: A Framework for Concept-Based Quiz Generation
Abstract:
Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.
中文: ConQuer是一个基于概念的测验生成框架,通过整合外部知识提升AI生成测验的质量,在评估分数和成对比较中均取得显著提升。
English: ConQuer is a concept-based quiz generation framework that enhances AI-generated quiz quality by integrating external knowledge, achieving significant improvements in evaluation scores and pairwise comparisons.

Authors:Yi Liao, Yongsheng Gao, Weichuan Zhang
Title: Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer
Abstract:
Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.
中文: 本文提出了一种名为DAAM的新型视觉解释方法,通过分解空间特征和计算通道重要性系数,动态可视化Vision Transformer模型内部的注意力流动,从而能够解释监督式和非监督式ViT模型的决策过程。
English: The paper introduces DAAM, a novel visual explanation method that dynamically visualizes attention flow within Vision Transformer models by decomposing spatial features and computing channel importance coefficients, enabling interpretation of both supervised and self-supervised ViTs.

Authors:Merkourios Simos, Alberto Silvio Chiappa, Alexander Mathis
Title: Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control
Abstract:
How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein's redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at https://github.com/amathislab/Kinesis.
中文: 本研究提出的KINESIS无模型框架通过精确的肌肉骨骼系统成功模拟人体运动,不仅能通过自然语言控制生成与肌电信号高度吻合的肌肉活动模式,还为探索运动控制理论提供了新途径。
English: This study introduces KINESIS, a model-free framework that effectively imitates human motion using a detailed musculoskeletal model, demonstrates controllability via natural language, and generates physiologically plausible muscle activity patterns for advancing motor control research.

Authors:Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu
Title: Can Large Vision Language Models Read Maps Like a Human?
Abstract:
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
中文:MapBench是首个专为人类可读地图导航设计的数据集,包含1600多个路径规划问题,通过直接和结构化提示方法显著挑战了大型视觉语言模型的空间推理能力。
English: MapBench is the first dataset for human-readable map navigation, featuring over 1600 pathfinding problems that challenge LVLMs' spatial reasoning through both direct and structured prompting methods.

Authors:Sara Sarto, Marcella Cornia, Rita Cucchiara
Title: Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Abstract:
The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
中文: 本文综述了图像描述评估指标的发展历程与局限性,重点分析了多模态大语言模型生成详细描述带来的挑战,并提出了未来研究方向。
English: This survey comprehensively reviews the evolution and limitations of image captioning evaluation metrics, emphasizing the challenges posed by MLLMs' detailed outputs and proposing future research directions.

Authors:Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers
Title: Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation
Abstract:
The capacity of a foundation model allows for adaptation to new downstream tasks. Weight imprinting is a universal and efficient method to fulfill this purpose. It has been reinvented several times, but it has not been systematically studied. In this paper, we propose a framework for imprinting, identifying three main components: generation, normalization, and aggregation. This allows us to conduct an in-depth analysis of imprinting and a comparison of the existing work. We reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. We determine proxies through clustering and propose a novel variant of imprinting that outperforms previous work. We motivate this by the neural collapse phenomenon -- an important connection that we can draw for the first time. Our results show an increase of up to 4\% in challenging scenarios with complex data distributions for new classes. Finally, we publicly release our code at https://github.com/DATEXIS/multi-imprinting/.
基础模型能够通过权重印记高效适应新任务,本文系统分析了该方法并提出了一个框架,揭示了使用多个代理和适当归一化的优势,在复杂场景下性能提升高达4%。
Foundation models can efficiently adapt to new tasks through weight imprinting, a method systematically analyzed in this paper, which introduces a framework revealing the benefits of multiple proxies and proper normalization, achieving up to 4% improvement in complex scenarios.

Authors:Guowei Wang, Changxing Ding
Title: Effortless Active Labeling for Long-Term Test-Time Adaptation
Abstract:
Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.
中文: 本文提出了一种轻松主动标注策略,通过特征扰动识别每个批次中最有价值的样本进行标注,并利用动态权重平衡标注与未标注样本对模型优化的影响,在多个数据集上以极低标注成本实现了优于现有方法的性能。
English: This paper introduces an effortless active labeling strategy for long-term test-time adaptation that selects at most one sample per batch for annotation based on feature perturbation and balances optimization impact with dynamic weights, achieving superior performance with minimal annotation costs across multiple datasets.

Authors:Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar
Title: A Generalist Hanabi Agent
Abstract:
Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at: $\href{https://github.com/chandar-lab/R3D2-A-Generalist-Hanabi-Agent}{R3D2-A-Generalist-Hanabi-Agent}$
中文: 传统多智能体强化学习系统难以适应新环境和陌生合作者,而提出的R3D2智能体通过文本重构和分布式算法,实现了全游戏模式通用并能与不同算法智能体成功协作。
English: Traditional multi-agent reinforcement learning systems lack adaptability to new settings and collaborators, but the proposed R3D2 agent overcomes this by using text-based reformulation and a distributed algorithm to play all game settings and cooperate with diverse agents.

Authors:Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour
Title: Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions
Abstract:
Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.
中文摘要:本文挑战了联邦学习中标签分布偏斜在捕捉计算机视觉任务中真实数据异质性方面的不足,提出了基于嵌入的异质性概念和新基准,揭示了现有文献对联邦学习性能的高估。
English Summary: This paper challenges the adequacy of label distribution skew for capturing real-world data heterogeneity in federated learning for computer vision tasks, proposing embedding-based heterogeneity and new benchmarks that reveal performance overestimations in existing literature.

Authors:Huaqiu Li, Xiaowan Hu, Haoqian Wang
Title: Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios
Abstract:
Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.
中文: 本文提出了一种无需参考图像的无监督、可解释性低光增强与去噪联合框架,通过频域分解和隐式引导表示策略,有效解决真实场景中复杂的图像退化问题。
English: This paper introduces an unsupervised, interpretable framework for joint denoising and low-light enhancement in real-world images, utilizing frequency domain decomposition and implicit-guided representation to effectively address complex degradations without reference images.

Authors:Tao Yu, Yi-Fan Zhang, Chaoyou Fu, Junkang Wu, Jinda Lu, Kun Wang, Xingyu Lu, Yunhang Shen, Guibin Zhang, Dingjie Song, Yibo Yan, Tianlong Xu, Qingsong Wen, Zhang Zhang, Yan Huang, Liang Wang, Tieniu Tan
Title: Aligning Multimodal LLM with Human Preference: A Survey
Abstract:
Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment.
中文: 本文系统综述了多模态大语言模型的校准算法,涵盖其应用场景、数据集构建、评估基准及未来方向,旨在提升模型的真实性和安全性。
English: This paper provides a comprehensive review of alignment algorithms for multimodal large language models, addressing their applications, dataset construction, evaluation benchmarks, and future directions to enhance truthfulness and safety.

Authors:Ayesha Ishaq, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer
Title: Tracking Meets Large Multimodal Models for Driving Scenario Understanding
Abstract:
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM
中文摘要:大型多模态模型在自动驾驶中常忽略三维时空信息,为此我们通过创新的跟踪编码器整合追踪数据,显著提升了多个基准测试中的性能表现。
English Summary: Large Multimodal Models in autonomous driving often underutilize 3D spatiotemporal data, so we enhance them by integrating tracking information through a novel encoder, achieving significant performance improvements across benchmarks.

Authors:Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang
Title: Temporal Consistency for LLM Reasoning Process Error Identification
Abstract:
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency
中文: 本文提出了一种时序一致性方法,通过迭代性自我反思提升数学推理验证效果,在多个基准测试中表现优异,使小型蒸馏模型性能超越包括GPT-4o在内的大型模型。
English: This paper introduces a temporal consistency method that enhances mathematical reasoning verification through iterative self-reflection, achieving superior performance on multiple benchmarks and enabling smaller distilled models to outperform larger ones, including GPT-4o.

Authors:NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Xinglong Sun, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng
Title: Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
Abstract:
We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.
中文:Cosmos-Transfer 是一种条件性世界生成模型,通过可自定义的空间输入(如分割和深度)实现高度可控的模拟,应用于机器人和自动驾驶领域,并已开源供研究使用。
English: Cosmos-Transfer is a conditional world generation model that uses customizable spatial inputs like segmentation and depth to enable highly controllable simulations, with applications in robotics and autonomous vehicles, and it is open-sourced for research.

Authors:Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Title: Gricean Norms as a Basis for Effective Collaboration
Abstract:
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的人机协作需要AI代理运用格莱斯会话准则处理模糊指令,配备此规范的Lamoid代理在合作任务中展现出更高的准确性和语境适应性。
Effective human-AI collaboration requires AI agents to handle unclear instructions using Gricean conversational norms, as demonstrated by the improved performance of Lamoid agents equipped with these principles in collaborative tasks.

Authors:Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin
Title: Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Abstract:
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.
Chinese: Creation-MMBench是一个专门评估多模态大语言模型在图像任务中创造力的新基准,研究发现开源模型表现远逊于专有模型,且视觉微调可能削弱模型的创新能力。
English: Creation-MMBench is a new multimodal benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models (MLLMs) in image-based tasks, revealing that current open-source models significantly lag behind proprietary ones and that visual fine-tuning can impair creativity.

Authors:Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
Title: RWKV-7 "Goose" with Expressive Dynamic State Evolution
Abstract:
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
中文: RWKV-7 "Goose" 是一种新型序列建模架构,尽管训练数据量较少,却能在多语言任务中达到顶尖性能,同时保持恒定内存和推理时间,并能执行状态跟踪和识别所有正则语言,超越了Transformer的固有局限。
English: RWKV-7 "Goose" is a novel sequence modeling architecture that achieves state-of-the-art performance in multilingual tasks with constant memory and inference time, despite training on fewer tokens, and demonstrates capabilities beyond Transformers by performing state tracking and recognizing all regular languages.

Authors:Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, Yaroslav Zharov
Title: EnvBench: A Benchmark for Automated Environment Setup
Abstract:
Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at https://github.com/JetBrains-Research/EnvBench. The dataset and experiment trajectories are available at https://jb.gg/envbench.
中文: 本文提出了EnvBench这一全面评估软件仓库环境配置任务的基准,涵盖329个Python和665个基于JVM的具有挑战性配置的项目,实验表明现有方法仅能成功配置6.69%的Python项目和29.47%的JVM项目,凸显了该基准的难度。
English: This paper introduces EnvBench, a comprehensive benchmark for evaluating environment setup tasks in software repositories, covering 329 Python and 665 JVM-based projects with challenging configurations, and demonstrates its difficulty as current methods achieve only 6.69% success for Python and 29.47% for JVM repositories.

Authors:Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Title: ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Abstract:
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
Chinese: 随着深度伪造视频的真实感日益增强,人类和自动化检测系统都面临挑战,为此我们推出了首个可解释深度伪造检测数据集ExDDV,通过文本和点击标注提升模型对伪造痕迹的定位与描述能力,确保检测结果既可靠又可解释。
English: The increasing realism of deepfake videos challenges both human detection and automated systems, which often lack explainability, prompting the introduction of ExDDV—the first dataset and benchmark for explainable deepfake detection, using text and click annotations to enhance model robustness and localization of artifacts.

Authors:Merijn Floren, Jean-Philippe Noël, Jan Swevers
Title: Inference and Learning of Nonlinear LFR State-Space Models
Abstract:
Estimating the parameters of nonlinear block-oriented state-space models from input-output data typically involves solving a highly non-convex optimization problem, which is prone to poor local minima and slow convergence. This paper presents a computationally efficient initialization method for nonlinear linear fractional representation (NL-LFR) models using periodic data. By first inferring the latent signals and subsequently estimating the model parameters, the approach generates initial estimates for use in a later nonlinear optimization step. The proposed method shows robustness against poor local minima, and achieves a twofold error reduction compared to the state-of-the-art on a challenging benchmark dataset.
中文: 本文提出了一种利用周期数据对非线性线性分式表示模型进行高效初始化的方法,通过推断潜在信号和估计参数来避免不良局部极小值,并在基准数据集上实现比现有技术误差减半的效果。
English: This paper introduces an efficient initialization method for nonlinear linear fractional representation models using periodic data, which infers latent signals and estimates parameters to avoid poor local minima and reduce error by half compared to existing methods.

Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Sepp Hochreiter
Title: Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels
Abstract:
Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels (Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes and high arithmetic intensity by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.
Chinese: 瓦片式闪存线性注意力(TFLA)是一种新颖的线性RNN核算法,通过支持大块大小和高算术强度,实现了高效的长上下文建模,在速度基准测试中超越了现有注意力机制。
English: Tiled Flash Linear Attention (TFLA) is a novel kernel algorithm for linear RNNs that enables efficient long-context modeling by allowing large chunk sizes and high arithmetic intensity, outperforming existing attention mechanisms in speed benchmarks.

Authors:Sai Coumar, Zachary Kingston
Title: Evaluating Machine Learning Approaches for ASCII Art Generation
Abstract:
Generating structured ASCII art using computational techniques demands a careful interplay between aesthetic representation and computational precision, requiring models that can effectively translate visual information into symbolic text characters. Although Convolutional Neural Networks (CNNs) have shown promise in this domain, the comparative performance of deep learning architectures and classical machine learning methods remains unexplored. This paper explores the application of contemporary ML and DL methods to generate structured ASCII art, focusing on three key criteria: fidelity, character classification accuracy, and output quality. We investigate deep learning architectures, including Multilayer Perceptrons (MLPs), ResNet, and MobileNetV2, alongside classical approaches such as Random Forests, Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN), trained on an augmented synthetic dataset of ASCII characters. Our results show that complex neural network architectures often fall short in producing high-quality ASCII art, whereas classical machine learning classifiers, despite their simplicity, achieve performance similar to CNNs. Our findings highlight the strength of classical methods in bridging model simplicity with output quality, offering new insights into ASCII art synthesis and machine learning on image data with low dimensionality.
中文摘要:本研究表明,在生成结构化ASCII艺术时,随机森林和支持向量机等经典机器学习方法能够取得与复杂深度学习架构相当的性能,突显了它们在模型简洁性与输出质量之间实现平衡的有效性。
English Summary: This study demonstrates that classical machine learning methods like Random Forests and SVMs can achieve comparable performance to complex deep learning architectures in generating structured ASCII art, emphasizing their effectiveness in balancing model simplicity with output quality.

Authors:Yali Bi, Enyu Che, Yinan Chen, Yuanpeng He, Jingwei Qu
Title: Multi-Prototype Embedding Refinement for Semi-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at https://github.com/Briley-byl123/MPER.
Chinese: 本研究提出了一种基于多原型嵌入优化的半监督医学图像分割方法,通过多原型聚类和一致性约束改进类内差异建模,在基准测试中取得了优于现有技术的性能。
English: This study introduces a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation, which enhances intra-class variance modeling through multi-prototype clustering and consistency constraints, achieving state-of-the-art performance on benchmark datasets.

Authors:Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Title: PENCIL: Long Thoughts with Short Memory
Abstract:
While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems at test-time is fundamentally limited by suboptimal memory usage -- intermediate computations accumulate indefinitely in context even when no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively cleans up intermediate thoughts based on patterns learned from training. By iteratively generating and erasing thoughts, PENCIL can think deeper to solve harder problems using shorter context and less compute. Empirically, we observe PENCIL is significantly more effective and efficient than CoT. For example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein's puzzle -- a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.
中文: PENCIL通过引入一种新颖的消减机制,在推理过程中递归清理中间思考,从而能以更短的上下文和更少的计算资源解决比传统思维链方法更复杂的问题。
English: PENCIL introduces a novel reduction mechanism that recursively cleans up intermediate thoughts during reasoning, enabling deeper problem-solving with shorter context and less computation than traditional Chain-of-Thought methods.

Authors:Yu Cheng, Fajie Yuan
Title: LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
Abstract:
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE
中文: 提出的LeanVAE框架采用轻量级架构和基于小波的增强技术,在保持高质量视频重建的同时,显著降低了计算成本并加快了推理速度。
English: The proposed LeanVAE framework introduces a lightweight architecture and wavelet-based enhancements to significantly reduce computational costs and accelerate inference in video generation, while maintaining high reconstruction quality.

Authors:Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, Yiqun Liu
Title: JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System
Abstract:
This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE.
中文摘要:本文提出JuDGE这一中国法律判决文书生成评估新基准,通过包含真实案例的完整数据集和自动化评估框架证明检索增强方法能有效提升生成质量,但仍需进一步改进。
English Summary: This paper presents JuDGE, a new benchmark for evaluating judgment document generation in Chinese law, featuring a comprehensive dataset and automated evaluation framework that shows retrieval-augmented methods improve performance but require further development.

Authors:Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, Zhizhong Su
Title: GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for Dynamic Legged Robotics
Abstract:
This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robotics undergoing aggressive and high-frequency motions.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/GeoFlowSlam
中文: GeoFlow-SLAM 是一种用于腿式机器人的鲁棒RGBD-惯性SLAM系统,通过结合几何约束和双流光流技术,有效解决了快速运动和纹理缺失场景中的特征匹配难题,实现了业界领先的性能。
English: GeoFlow-SLAM is a robust RGBD-inertial SLAM system for legged robots that integrates geometric constraints and dual-stream optical flow to overcome feature matching failures in aggressive motions and texture-less environments, achieving state-of-the-art performance.

Authors:Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang
Title: Make Your Training Flexible: Towards Deployment-Efficient Video Models
Abstract:
Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.
中文摘要:提出的Flux方法通过灵活采样视频并优化令牌选择,以极低成本显著提升了视频训练的精度与计算效率,同时增强了模型在不同计算预算下的适应能力。
English Summary: The proposed Flux method enhances video training by dynamically optimizing token selection from flexibly sampled videos, achieving superior accuracy-computation efficiency and adaptability across budgets with minimal cost.

Authors:Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, Wei-Shi Zheng
Title: RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
Abstract:
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.
中文: RoGSplat提出了一种无需逐对象优化的新方法,通过利用基于SMPL的3D先验点和多级特征,从稀疏多视角图像合成未见人体的高保真新视图,在挑战性条件下实现鲁棒重建和卓越性能。
English: RoGSplat introduces a novel method for synthesizing high-fidelity novel views of unseen humans from sparse multi-view images without per-subject optimization, leveraging SMPL-based 3D prior points and multi-level features to achieve robust reconstruction and superior performance in challenging conditions.

Authors:Rui Cao, Wei Tu, Dongsheng Chen, Wenyu Zhang
Title: Mapping Urban Villages in China: Progress and Challenges
Abstract:
The shift toward high-quality urbanization has brought increased attention to the issue of "urban villages", which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages, making it crucial to prioritize urban village mapping. In order to assess the current progress in urban village mapping and identify challenges and future directions, we have conducted a comprehensive review, which to the best of our knowledge is the first of its kind in this field. Our review begins by providing a clear context for urban villages and elaborating the method for literature review, then summarizes the study areas, data sources, and approaches used for urban village mapping in China. We also address the challenges and future directions for further research. Through thorough investigation, we find that current studies only cover very limited study areas and periods and lack sufficient investigation into the scalability, transferability, and interpretability of identification approaches due to the challenges in concept fuzziness and variances, spatial heterogeneity and variances of urban villages, and data availability. Future research can complement and further the current research in the following potential directions in order to achieve large-area mapping across the whole nation...
中文: 该综述指出中国城中村测绘存在研究范围有限、方法可扩展性不足等问题,并提出未来需通过解决概念模糊性、空间异质性等挑战来实现全国范围的高效测绘。
English: The review highlights the critical need for comprehensive urban village mapping in China, revealing current limitations in study scope and methodology while suggesting future directions to overcome challenges like spatial heterogeneity and data constraints.

Authors:Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, Tom Hartvigsen
Title: Inferring Events from Time Series using Language Models
Abstract:
Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime
中文摘要:本研究首次评估了18个大语言模型从时间序列数据中推断自然语言事件的能力,发现OpenAI的o1模型表现最佳,而通过蒸馏技术显著提升了较小模型的性能。
English Summary: This study pioneers the evaluation of 18 large language models' ability to infer natural language events from time series data, finding OpenAI's o1 model performs best while distillation techniques significantly boost smaller models' performance.

Authors:Zining Wang, Tongkun Guan, Pei Fu, Chen Duan, Qianyi Jiang, Zhentao Guo, Shan Guo, Junfeng Luo, Wei Shen, Xiaokang Yang
Title: Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Abstract:
Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.
中文: 本研究提出了一种新颖的视觉问答掩码生成方法,通过语义和空间层面的双重对齐增强多模态文档理解,由此开发的Marten模型在文档任务中展现出卓越性能。
English: This study introduces a novel Visual Question Answering with Mask generation (VQAMask) method that enhances multi-modal document understanding by simultaneously aligning images and text at semantic and spatial levels, resulting in the development of Marten, an efficient MLLM that demonstrates superior performance in document-centric tasks.

Authors:Hao Zhang, Mingyue Cheng, Qi Liu, Junzhe Jiang, Xianquan Wang, Rujiao Zhang, Chenyi Lei, Enhong Chen
Title: A Comprehensive Survey on Cross-Domain Recommendation: Taxonomy, Progress, and Prospects
Abstract:
Recommender systems (RS) have become crucial tools for information filtering in various real world scenarios. And cross domain recommendation (CDR) has been widely explored in recent years in order to provide better recommendation results in the target domain with the help of other domains. The CDR technology has developed rapidly, yet there is a lack of a comprehensive survey summarizing recent works. Therefore, in this paper, we will summarize the progress and prospects based on the main procedure of CDR, including Cross Domain Relevance, Cross Domain Interaction, Cross Domain Representation Enhancement and Model Optimization. To help researchers better understand and engage in this field, we also organize the applications and resources, and highlight several current important challenges and future directions of CDR. More details of the survey articles are available at https://github.com/USTCAGI/Awesome-Cross-Domain Recommendation-Papers-and-Resources.
Chinese: 本文对跨领域推荐(CDR)进行了全面综述,总结了其进展、应用和资源,并指出了当前的重要挑战和未来方向,以帮助研究人员更好地理解和参与该领域。
English: This paper provides a comprehensive survey of cross-domain recommendation (CDR) by summarizing its progress, applications, and resources, while highlighting key challenges and future directions to aid researchers in the field.

Authors:Weihong Chen, Xuemiao Xu, Haoxin Yang, Yi Xie, Peng Xiao, Cheng Xu, Huaidong Zhang, Pheng-Ann Heng
Title: SCJD: Sparse Correlation and Joint Distillation for Efficient 3D Human Pose Estimation
Abstract:
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at https://github.com/wileychan/SCJD.
中文:提出的SCJD框架通过稀疏输入降采样和动态关节注意力蒸馏技术,在提升三维人体姿态估计效率的同时,通过优化时空建模保持了精度优势。
English: The proposed SCJD framework enhances 3D human pose estimation by introducing sparse input downsampling and dynamic joint attention distillation to improve efficiency while maintaining accuracy through better spatial and temporal modeling.

Authors:Shengping Zhang, Xiaoyu Han, Weigang Zhang, Xiangyuan Lan, Hongxun Yao, Qingming Huang
Title: Limb-Aware Virtual Try-On Network with Progressive Clothing Warping
Abstract:
Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details well because they are limited by the used clothing-agnostic person representation without referring to the limb textures of the person image. To address these problems, we propose Limb-aware Virtual Try-on Network named PL-VTON, which performs fine-grained clothing warping progressively and generates high-quality try-on results with realistic limb details. Specifically, we present Progressive Clothing Warping (PCW) that explicitly models the location and size of in-shop clothing and utilizes a two-stage alignment strategy to progressively align the in-shop clothing with the human body. Moreover, a novel gravity-aware loss that considers the fit of the person wearing clothing is adopted to better handle the clothing edges. Then, we design Person Parsing Estimator (PPE) with a non-limb target parsing map to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and body regions. Finally, we introduce Limb-aware Texture Fusion (LTF) that focuses on generating realistic details in limb regions, where a coarse try-on result is first generated by fusing the warped clothing image with the person image, then limb textures are further fused with the coarse result under limb-aware guidance to refine limb details. Extensive experiments demonstrate that our PL-VTON outperforms the state-of-the-art methods both qualitatively and quantitatively.
中文: PL-VTON模型通过渐进式服装变形和肢体感知纹理融合技术,解决了虚拟试衣中的服装变形和肢体细节缺失问题,在效果上超越了现有最优方法。
English: The PL-VTON model addresses clothing distortion and poor limb detail in virtual try-on by introducing progressive clothing warping and limb-aware texture fusion, achieving superior results over existing methods.

Authors:Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
Title: Learning on LLM Output Signatures for gray-box Behavior Analysis
Abstract:
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require ``white-box'' access to model internals, often unavailable. Current ``gray-box'' approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: https://github.com/BarSGuy/LLM-Output-Signatures-Network.
中文: 本文提出LOS-Net,一种基于轻量级注意力机制的模型,通过利用完整的下一个词符概率分布序列(称为LLM输出签名),在不访问模型内部参数的情况下,有效检测大语言模型中的幻觉和训练数据污染问题。
English: This paper introduces LOS-Net, a lightweight attention-based model that utilizes the full sequence of next-token probability distributions, termed LLM Output Signature, to effectively detect hallucinations and training data contamination in Large Language Models without accessing internal model parameters.

Authors:Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron
Title: Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions
Abstract:
The automated detection of hallucinations and training data contamination is pivotal to the safe deployment of Large Language Models (LLMs). These tasks are particularly challenging in settings where no access to model internals is available. Current approaches in this setup typically leverage only the probabilities of actual tokens in the text, relying on simple task-specific heuristics. Crucially, they overlook the information contained in the full sequence of next-token probability distributions. We propose to go beyond hand-crafted decision rules by learning directly from the complete observable output of LLMs -- consisting not only of next-token probabilities, but also the full sequence of next-token distributions. We refer to this as the LLM Output Signature (LOS), and treat it as a reference data type for detecting hallucinations and data contamination. To that end, we introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LOS, which can provably approximate a broad class of existing techniques for both tasks. Empirically, LOS-Net achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency. Furthermore, it demonstrates promising transfer capabilities across datasets and LLMs. Full code is available at https://github.com/BarSGuy/Beyond-next-token-probabilities.
中文: 本文提出LOS-Net,一种基于轻量级注意力机制的模型,通过利用完整的下一个词符概率分布序列(称为LLM输出签名),在不访问模型内部参数的情况下,有效检测大语言模型中的幻觉和训练数据污染问题。
English: This paper introduces LOS-Net, a lightweight attention-based model that utilizes the full sequence of next-token probability distributions, termed LLM Output Signature, to effectively detect hallucinations and training data contamination in Large Language Models without accessing internal model parameters.

Authors:Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, Chi-Wing Fu
Title: Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting
Abstract:
Lifting multi-view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end-to-end lifting, yielding inferior results; or employ a two-stage solution constrained by complex pre- or post-processing. In this work, we design a new end-to-end object-aware lifting approach, named Unified-Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object-level codebook to account for individual objects in the scene for an explicit object-level understanding and associate the encoded object-level features with the Gaussian-level point features for segmentation predictions. While promising, achieving effective codebook learning is non-trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified-Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \href{https://github.com/Runsong123/Unified-Lift}{https://github.com/Runsong123/Unified-Lift}.
Chinese: 本文提出Unified-Lift方法,通过结合高斯点特征与可学习的对象级码本实现端到端的对象感知三维分割,在精度和效率上均优于现有方法。
English: This paper introduces Unified-Lift, an end-to-end object-aware lifting method that enhances 3D segmentation by integrating Gaussian-level features with a learnable object-level codebook, outperforming existing approaches in accuracy and efficiency.

Authors:Wei Lu, Si-Bao Chen, Hui-Dong Li, Qing-Ling Shu, Chris H. Q. Ding, Jin Tang, Bin Luo
Title: LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection
Abstract:
Remote sensing object detection (RSOD) often suffers from degradations such as low spatial resolution, sensor noise, motion blur, and adverse illumination. These factors diminish feature distinctiveness, leading to ambiguous object representations and inadequate foreground-background separation. Existing RSOD methods exhibit limitations in robust detection of low-quality objects. To address these pressing challenges, we introduce LEGNet, a lightweight backbone network featuring a novel Edge-Gaussian Aggregation (EGA) module specifically engineered to enhance feature representation derived from low-quality remote sensing images. EGA module integrates: (a) orientation-aware Scharr filters to sharpen crucial edge details often lost in low-contrast or blurred objects, and (b) Gaussian-prior-based feature refinement to suppress noise and regularize ambiguous feature responses, enhancing foreground saliency under challenging conditions. EGA module alleviates prevalent problems in reduced contrast, structural discontinuities, and ambiguous feature responses prevalent in degraded images, effectively improving model robustness while maintaining computational efficiency. Comprehensive evaluations across five benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0, and VisDrone2019) demonstrate that LEGNet achieves state-of-the-art performance, particularly in detecting low-quality objects. The code is available at https://github.com/lwCVer/LEGNet.
中文: LEGNet提出了一种轻量级主干网络,其边缘-高斯聚合模块通过锐化边缘和抑制噪声来增强低质量遥感图像中的特征表示,在多个基准测试中实现了最先进的检测性能。
English: LEGNet introduces a lightweight backbone with an Edge-Gaussian Aggregation module that enhances feature representation in low-quality remote sensing images by sharpening edges and suppressing noise, achieving state-of-the-art detection performance across multiple benchmarks.

Authors:Zixuan Zheng, Yilei Shi, Chunlei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Rethinking Cell Counting Methods: Decoupling Counting and Localization
Abstract:
Cell counting in microscopy images is vital in medicine and biology but extremely tedious and time-consuming to perform manually. While automated methods have advanced in recent years, state-of-the-art approaches tend to increasingly complex model designs. In this paper, we propose a conceptually simple yet effective decoupled learning scheme for automated cell counting, consisting of separate counter and localizer networks. In contrast to jointly learning counting and density map estimation, we show that decoupling these objectives surprisingly improves results. The counter operates on intermediate feature maps rather than pixel space to leverage global context and produce count estimates, while also generating coarse density maps. The localizer then reconstructs high-resolution density maps that precisely localize individual cells, conditional on the original images and coarse density maps from the counter. Besides, to boost counting accuracy, we further introduce a global message passing module to integrate cross-region patterns. Extensive experiments on four datasets demonstrate that our approach, despite its simplicity, challenges common practice and achieves state-of-the-art performance by significant margins. Our key insight is that decoupled learning alleviates the need to learn counting on high-resolution density maps directly, allowing the model to focus on global features critical for accurate estimates. Code is available at https://github.com/MedAITech/DCL.
中文: 本文提出解耦学习方案,通过分离计数与定位网络,利用全局上下文生成粗密度图并重构高分辨率定位,以更简单的模型实现了最优性能。
English: This paper introduces a decoupled learning approach that separates cell counting and localization into distinct networks, achieving superior accuracy by focusing on global context and simplifying high-resolution density map reconstruction.

Authors:Mykyta Syromiatnikov, Victoria Ruvinskaya, Nataliia Komleva
Title: Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks
Abstract:
Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at https://github.com/NLPForUA/ZNO.
中文: 本研究证明,通过对LLaMA和Gemma等紧凑型语言模型进行参数高效微调,能显著提升其在乌克兰语推理任务中的表现,使其在保持计算效率的同时超越GPT-4o mini和Mistral Large等更大模型。
English: This study demonstrates that parameter-efficient fine-tuning of compact language models like LLaMA and Gemma significantly enhances their performance on reasoning tasks in Ukrainian, enabling them to surpass larger models including GPT-4o mini and Mistral Large while maintaining computational efficiency.

Authors:Yaxiong Chen, Yujie Wang, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Title: Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation
Abstract:
Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at https://github.com/WUTCM-Lab/Shape-Prior-Semi-Seg.
中文摘要:本文提出了一种简单有效的伪标签方法,通过对抗学习形状先验来规范医学图像分割,采用编码器-双解码器网络防止解剖学上不合理的预测,在两个基准测试中实现了最先进的性能。
English Summary: This paper introduces a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize medical image segmentations, achieving state-of-the-art performance on two benchmarks through an encoder-twin-decoder network that prevents anatomically implausible predictions.

Authors:Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, Yongdong Zhang
Title: SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Abstract:
Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released at https://github.com/Jayce1kk/SpaceVLLM.
Chinese: SpaceVLLM通过引入时空感知查询和查询引导的空间解码器,解决了多模态模型在视频时空定位中的难题,并利用新构建的Uni-STG数据集在11个基准测试中取得了最优性能。
English: SpaceVLLM introduces spatio-temporal aware queries and a query-guided decoder to overcome multimodal models' limitations in video grounding, achieving state-of-the-art results across 11 benchmarks with its newly created Uni-STG dataset.

Authors:Huy-Hoang Bui, Bach-Thuan Bui, Quang-Vinh Tran, Yasuyuki Fujii, Joo-Ho Lee
Title: A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios
Abstract:
Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.
中文摘要:A-SCoRe模型采用基于注意力的视觉定位架构,通过描述符层面的注意力机制有效捕捉像素空间关系,在保持轻量化的同时实现了与主流方法相当的跨场景适应能力。
English Summary: The A-SCoRe model introduces an attention-based architecture for visual localization that overcomes storage limitations and captures spatial relationships between pixels, achieving lightweight yet competitive performance across diverse environments.

Authors:Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao
Title: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Abstract:
Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.
中文: MDocAgent提出了一种多模态多智能体框架,通过五个专业智能体协同整合文本和图像分析,在五个基准测试中平均性能提升12.1%,显著提升了文档问答中的多模态推理能力。
English: MDocAgent introduces a multi-modal multi-agent framework that integrates text and image analysis through five specialized agents, achieving a 12.1% average improvement on benchmarks by enhancing multi-modal reasoning for document question answering.

Authors:Mu Chen, Liulei Li, Wenguan Wang, Yi Yang
Title: DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation
Abstract:
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
中文: DIFFVSGG是一种在线视频场景图生成方法,它将任务重构为迭代图更新,利用统一的潜在扩散模型高效解码物体与关系特征,实现实时处理、连续时序推理并降低GPU内存消耗。
English: DIFFVSGG is an online Video Scene Graph Generation method that reframes the task as iterative graph updates using a unified latent diffusion model to efficiently decode object and relationship features, enabling real-time processing and continuous temporal reasoning with reduced GPU memory.

Authors:Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Title: Improving LLM Video Understanding with 16 Frames Per Second
Abstract:
Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data at \href{https://github.com/bytedance/F-16}{https://github.com/bytedance/F-16}.
中文: F-16是一种多模态大语言模型,通过处理16帧每秒并压缩视觉标记,在多项基准测试和复杂任务中实现最优性能,无需扩大模型规模或训练数据。
English: F-16 is a multimodal LLM that enhances video understanding by processing 16 FPS with compressed visual tokens, achieving state-of-the-art results in various benchmarks and complex tasks without increasing model size or data.

Authors:Xinqing Li, Ruiqi Song, Qingyu Xie, Ye Wu, Nanxin Zeng, Yunfeng Ai
Title: SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model
Abstract:
With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at https://github.com/Li-Zn-H/SimWorld.
中文: 本文提出了一种基于世界模型的仿真条件场景生成引擎,通过结合仿真系统与生成模型构建新型数据生成流程,有效提升自动驾驶感知模型在真实场景中的性能表现。
English: This paper introduces a simulator-conditioned scene generation engine that integrates simulation capabilities with world models to produce diverse datasets, significantly enhancing perception model performance in autonomous driving applications.

Authors:Kang Yang, Tianci Bu, Lantao Li, Chunxu Li, Yongcai Wang, Deying Li
Title: Is Discretization Fusion All You Need for Collaborative Perception?
Abstract:
Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \href{https://github.com/sidiangongyuan/ACCO}{https://github.com/sidiangongyuan/ACCO}.
中文: 本文提出ACCO这一以锚点为中心的协同目标检测新范式,通过灵活的锚点通信与融合机制,在显著降低通信量的同时有效提升了感知范围与检测性能。
English: This paper introduces ACCO, a novel anchor-centric collaborative object detection method that enhances perception by enabling flexible anchor-based communication and fusion, significantly reducing communication volume while improving detection performance and range.

Authors:Dongkwan Lee, Kyomin Hwang, Nojun Kwak
Title: Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
Abstract:
We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at https://github.com/dongkwani/UPCSC.
中文: 本文提出UPCSC方法,通过利用未置信的未标记样本进行对比学习和代理类别学习,无需领域标签即可在半监督领域泛化中提升模型性能。
English: This paper introduces UPCSC, a novel method for semi-supervised domain generalization that leverages unconfident-unlabeled samples through contrastive learning and surrogate class modules to enhance model performance without requiring domain labels.

Authors:Barza Nisar, Steven L. Waslander
Title: PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
Abstract:
Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on https://github.com/TRAILab/PSA-SSL.
中文: 提出的PSA-SSL方法通过边界框回归任务和激光雷达束增强,在三维点云自监督学习中融入物体姿态与尺寸感知,在多个自动驾驶数据集上以少量标注数据实现了三维语义分割和物体检测的显著性能提升。
English: The proposed PSA-SSL method enhances self-supervised learning on 3D point clouds by incorporating object pose and size awareness through a bounding box regression task and LiDAR beam augmentation, achieving superior performance in 3D semantic segmentation and object detection with limited labeled data across autonomous driving datasets.

Authors:Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu
Title: Where do Large Vision-Language Models Look at when Answering Questions?
Abstract:
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.
Chinese: 本研究扩展了热力图可视化方法,以解释大型视觉语言模型在开放式问答中如何利用视觉输入,揭示了其关注区域、架构差异及语言模型规模对视觉理解影响的重要发现。
English: This study extends heatmap visualization methods to interpret how Large Vision-Language Models (LVLMs) utilize visual inputs for open-ended question answering, revealing key insights into their focus regions, architectural differences, and the impact of language model scale on visual understanding.

Authors:Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, Dae-Shik Kim
Title: MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
Abstract:
The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.
Chinese: 大型语言模型与视觉模型的融合正在推动视觉语言任务的发展,尤其在推理分割领域,但现有数据集缺乏多目标和部件级理解,为此构建了MMR数据集并提出新框架,在复杂场景中展现出更优的推理能力。
English: The integration of Large Language Models with vision models is advancing vision-language tasks, particularly in reasoning segmentation, but current datasets lack multi-target and part-level understanding, leading to the creation of the MMR dataset and a new framework that demonstrates improved performance in these complex scenarios.

Authors:Sunbowen Lee, Yicheng Gong, Chao Deng
Title: Counterfactual experience augmented off-policy reinforcement learning
Abstract:
Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at https://github.com/Aegis1863/CEA.
Chinese: 反事实经验增强(CEA)算法通过变分自编码器建模状态转移并生成反事实经验,有效缓解强化学习中的分布外和探索效率问题,在多种环境下均优于现有先进方法。
English: The Counterfactual Experience Augmentation (CEA) algorithm addresses out-of-distribution and exploration challenges in reinforcement learning by using variational autoencoders to model state transitions and generate counterfactual experiences, outperforming state-of-the-art methods across diverse environments.

Authors:Chunlei Li, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Title: Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection
Abstract:
Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at https://github.com/MedAITech/SCRD4AD.
Chinese Summary: 本文提出了一种新颖的尺度感知对比反向蒸馏模型,通过对比学习增强特征区分度,并采用自适应加权机制处理异常尺度变化,在基准数据集上实现了最先进的性能。
English Summary: This paper introduces a novel scale-aware contrastive reverse distillation model that enhances anomaly detection by improving feature discriminability through contrastive learning and addressing scale variations with adaptive weighting, achieving state-of-the-art results on benchmark datasets.

Authors:Jinping Wang, Weiwei Song, Hao Chen, Jinchang Ren, Huimin Zhao
Title: FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification
Abstract:
World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy-wang/FusDreamer.
中文: 本文提出FusDreamer,一种标签高效的遥感世界模型,通过潜在扩散融合和知识引导的一致性整合多模态数据,以弥合领域差距并用有限样本提升学习效果。
English: This paper introduces FusDreamer, a label-efficient remote sensing world model that integrates multimodal data through latent diffusion fusion and knowledge-guided consistency to bridge domain gaps and enhance learning with limited samples.

Authors:Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour
Title: FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution
Abstract:
Video super-resolution aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the DWT to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets demonstrate that FedVSR consistently outperforms existing FL methods, achieving up to 0.82 dB higher PSNR, 0.0327 higher SSIM, and 0.0251 lower LPIPS. These results underscore FedVSR's ability to bridge the gap between privacy and performance, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR
Chinese: FedVSR是首个专为视频超分辨率设计的联邦学习框架,通过基于离散小波变换的轻量损失函数和损失感知聚合策略,在保护隐私的同时显著提升了图像质量,性能优于现有方法。
English: FedVSR is a novel federated learning framework tailored for video super-resolution that enhances privacy and performance by using a DWT-based loss function and loss-aware aggregation, achieving superior results over existing methods.

Authors:Keqi Chen, Vinkle Srivastav, Didier Mutter, Nicolas Padoy
Title: Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
Abstract:
Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.
中文摘要:本文提出Self-MVA方法,通过跨视角图像同步的自监督学习实现无需标注或相机标定的多视角行人关联,在多个基准数据集上达到最优性能。
English Summary: This paper introduces Self-MVA, a self-supervised approach for multi-view person association that learns from cross-view synchronization without annotations or camera calibration, achieving state-of-the-art performance on benchmark datasets.

Authors:Yushan Jiang, Kanghui Ning, Zijie Pan, Xuyang Shen, Jingchao Ni, Wenchao Yu, Anderson Schneider, Haifeng Chen, Yuriy Nevmyvaka, Dongjin Song
Title: Multi-modal Time Series Analysis: A Tutorial and Survey
Abstract:
Multi-modal time series analysis has recently emerged as a prominent research area in data mining, driven by the increasing availability of diverse data modalities, such as text, images, and structured tabular data from real-world sources. However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise. Recent advancements in multi-modal time series methods have exploited the multi-modal context via cross-modal interactions based on deep learning methods, significantly enhancing various downstream tasks. In this tutorial and survey, we present a systematic and up-to-date overview of multi-modal time series datasets and methods. We first state the existing challenges of multi-modal time series analysis and our motivations, with a brief introduction of preliminaries. Then, we summarize the general pipeline and categorize existing methods through a unified cross-modal interaction framework encompassing fusion, alignment, and transference at different levels (\textit{i.e.}, input, intermediate, output), where key concepts and ideas are highlighted. We also discuss the real-world applications of multi-modal analysis for both standard and spatial time series, tailored to general and specific domains. Finally, we discuss future research directions to help practitioners explore and exploit multi-modal time series. The up-to-date resources are provided in the GitHub repository: https://github.com/UConn-DSIS/Multi-modal-Time-Series-Analysis
中文:本教程系统概述了多模态时间序列分析,通过跨模态交互方法解决数据异构性和模态差异等挑战,同时探讨了实际应用和未来研究方向。
English: This tutorial provides a systematic overview of multi-modal time series analysis, addressing challenges like data heterogeneity and modality gaps through cross-modal interaction methods, while also exploring applications and future research directions.

Authors:Sai Coumar, Gilbert Chang, Nihar Kodkani, Zachary Kingston
Title: Foam: A Tool for Spherical Approximation of Robot Geometry
Abstract:
Many applications in robotics require primitive spherical geometry, especially in cases where efficient distance queries are necessary. Manual creation of spherical models is time-consuming and prone to errors. This paper presents Foam, a tool to generate spherical approximations of robot geometry from an input Universal Robot Description Format (URDF) file. Foam provides a robust preprocessing pipeline to handle mesh defects and a number of configuration parameters to control the level and approximation of the spherization, and generates an output URDF with collision geometry specified only by spheres. We demonstrate Foam on a number of standard robot models on common tasks, and demonstrate improved collision checking and distance query performance with only a minor loss in fidelity compared to the true collision geometry. We release our tool as an open source Python library and containerized command-line application to facilitate adoption across the robotics community.
中文: 本文提出Foam工具,可通过URDF文件自动生成机器人几何的球形近似模型,在保证精度的同时显著提升碰撞检测与距离查询效率,并开源发布以促进广泛应用。
English: This paper introduces Foam, an open-source tool that automatically generates spherical approximations of robot geometry from URDF files to enhance collision checking and distance query efficiency with minimal fidelity loss.

Authors:Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer
Title: Web Artifact Attacks Disrupt Vision Language Models
Abstract:
Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a "typographic" attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks
中文: 视觉语言模型易受新型“基于伪影”攻击的影响,这些攻击利用网络数据中的意外关联,通过不匹配文本和图形元素误导预测,并在不同模型间具有高迁移性。
English: Vision-language models are vulnerable to novel "artifact-based" attacks that exploit unintended correlations from web data, using both non-matching text and graphical elements to mislead predictions with high transferability across models.

Authors:Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari
Title: Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Abstract:
Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.
中文: 现有第一人称视频问答数据集和模型常忽略真实时序推理,因此作者提出EgoTempo新数据集,要求整合全视频信息以评估和改进多模态大语言模型的时序理解能力。
English: Current egocentric video QA datasets and models often bypass true temporal reasoning, so the authors introduce EgoTempo, a new dataset requiring full-video integration to assess and advance temporal understanding in MLLMs.

Authors:Shiran Yuan, Hao Zhao
Title: Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers
Abstract:
Methods based on diffusion backbones have recently revolutionized novel view synthesis (NVS). However, those models require pretrained 2D diffusion checkpoints (e.g., Stable Diffusion) as the basis for geometrical priors. Since such checkpoints require exorbitant amounts of data and compute to train, this greatly limits the scalability of diffusion-based NVS models. We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining. We achieve this by incorporating both global (pose-augmented semantics) and local (multi-scale hierarchical encodings) conditioning into a backbone based on the next-scale autoregression paradigm. Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion. We experimentally verify that performance scales with model and dataset size, and conduct extensive demonstration of our method's synthesis quality across several tasks. Our code is open-sourced at https://github.com/Shiran-Yuan/ArchonView.
中文: ArchonView提出了一种无需二维预训练的新视角合成方法,通过结合全局与局部条件的下一尺度自回归技术,在性能和推理速度上均超越了现有最佳方法。
English: ArchonView introduces a novel view synthesis method that surpasses state-of-the-art techniques by using next-scale autoregression with global and local conditioning, achieving superior performance without 2D pretraining and offering faster inference.

Authors:Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai
Title: Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception
Abstract:
We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at https://github.com/dk-liang/UniFuture.
中文: UniFuture是一种统一的驾驶世界模型,通过双潜在共享和多尺度交互机制联合预测未来RGB图像和深度图,在nuScenes数据集上实现了优于专用模型的生成与感知性能。
English: UniFuture is a unified driving world model that jointly predicts future RGB images and depth maps through dual-latent sharing and multi-scale interaction, achieving superior performance in generation and perception tasks on nuScenes dataset.

Authors:Pingyu Wu, Daiheng Gao, Jing Tang, Huimin Chen, Wenbo Zhou, Weiming Zhang, Nenghai Yu
Title: MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG
Abstract:
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to 0.83 (+0.25) on targeted task. Our code and data are available at https://github.com/wpydcr/MES-RAG.
中文:MES-RAG框架通过增强实体查询处理能力、采用主动安全措施及支持实时多模态输出,显著提升了检索增强生成系统的准确性和召回率,有效推进问答系统的安全性与实用性。
English: The MES-RAG framework enhances Retrieval-Augmented Generation by improving entity-specific query handling with proactive security measures and real-time multi-modal outputs, significantly boosting accuracy and recall in question-answering systems.

Authors:Lin-Han Jia, Lan-Zhe Guo, Zhi Zhou, Si-Ye Han, Zi-Wen Li, Yu-Feng Li
Title: Achieving Unbiased Multi-Instance Learning via Balanced Fine-Grained Positive-Unlabeled Learning
Abstract:
In real-world applications, it is often challenging to detect anomalous samples when the anomalous information they contain is extremely limited. In such cases, both macro-level and micro-level detection using multi-instance learning (MIL) encounter significant difficulties. The former struggles because normal and anomalous samples are highly similar and hard to distinguish at the macro level, while the latter is limited by the lack of labels at the micro level. In MIL, micro-level labels are inferred from macro-level labels, which can lead to severe bias. Moreover, the more imbalanced the distribution between normal and anomalous samples, the more pronounced these limitations become. In this study, we observe that the MIL problem can be elegantly transformed into a fine-grained Positive-Unlabeled (PU) learning problem. This transformation allows us to address the imbalance issue in an unbiased manner using a micro-level balancing mechanism. To this end, we propose a novel framework-Balanced Fine-Grained Positive-Unlabeled (BFGPU)-based on rigorous theoretical foundations to address the challenges above. Extensive experiments on both public and real-world datasets demonstrate the effectiveness of BFGPU, which outperforms existing methods, even in extreme scenarios where both macro and micro-level distributions are highly imbalanced. The code is open-sourced at https://github.com/BFGPU/BFGPU.
中文: 本研究通过将多示例学习重新定义为细粒度PU学习问题,提出了基于严格理论基础的BFGPU框架,利用微观平衡机制有效解决了异常样本稀缺的双重不平衡问题,并在合成和真实数据集上验证了其有效性。
English: The study tackles the dual imbalance in Multi-Instance Learning by reframing it as a fine-grained PU learning problem and introduces the BFGPU framework, which effectively addresses the scarcity of anomalies through micro-level balancing mechanisms, as validated by experiments on synthetic and real-world datasets.

Authors:Zhaodong Wu, Qiaochu Zhao, Ming Hu, Yulong Li, Haochen Xue, Kang Dang, Zhengyong Jiang, Angelos Stefanidis, Qiufeng Wang, Imran Razzak, Zongyuan Ge, Junjun He, Yu Qiao, Zhong Zheng, Feilong Tang, Jionglong Su
Title: MSWAL: 3D Multi-class Segmentation of Whole Abdominal Lesions Dataset
Abstract:
With the significantly increasing incidence and prevalence of abdominal diseases, there is a need to embrace greater use of new innovations and technology for the diagnosis and treatment of patients. Although deep-learning methods have notably been developed to assist radiologists in diagnosing abdominal diseases, existing models have the restricted ability to segment common lesions in the abdomen due to missing annotations for typical abdominal pathologies in their training datasets. To address the limitation, we introduce MSWAL, the first 3D Multi-class Segmentation of the Whole Abdominal Lesions dataset, which broadens the coverage of various common lesion types, such as gallstones, kidney stones, liver tumors, kidney tumors, pancreatic cancer, liver cysts, and kidney cysts. With CT scans collected from 694 patients (191,417 slices) of different genders across various scanning phases, MSWAL demonstrates strong robustness and generalizability. The transfer learning experiment from MSWAL to two public datasets, LiTS and KiTS, effectively demonstrates consistent improvements, with Dice Similarity Coefficient (DSC) increase of 3.00% for liver tumors and 0.89% for kidney tumors, demonstrating that the comprehensive annotations and diverse lesion types in MSWAL facilitate effective learning across different domains and data distributions. Furthermore, we propose Inception nnU-Net, a novel segmentation framework that effectively integrates an Inception module with the nnU-Net architecture to extract information from different receptive fields, achieving significant enhancement in both voxel-level DSC and region-level F1 compared to the cutting-edge public algorithms on MSWAL. Our dataset will be released after being accepted, and the code is publicly released at https://github.com/tiuxuxsh76075/MSWAL-.
中文: MSWAL数据集通过提供多种常见腹部病变的全面3D标注,解决了现有模型因训练数据标注缺失导致的病灶分割能力受限问题,同时提出的Inception nnU-Net框架通过多尺度特征整合实现了更优的分割性能。
English: The MSWAL dataset addresses the limitation of existing models in segmenting abdominal lesions by providing comprehensive 3D annotations for various common pathologies, while the proposed Inception nnU-Net framework demonstrates superior segmentation performance through multi-scale feature integration.

Authors:Jingyuan Xue, Longfei Wei, Dongjing Jiang, Fang Sheng, Russell Greiner, Jianfei Zhang
Title: Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life
Abstract:
Battery degradation significantly impacts the reliability and efficiency of energy storage systems, particularly in electric vehicles and industrial applications. Predicting the remaining useful life (RUL) of lithium-ion batteries is crucial for optimizing maintenance schedules, reducing costs, and improving safety. Traditional RUL prediction methods often struggle with nonlinear degradation patterns and uncertainty quantification. To address these challenges, we propose a hybrid survival analysis framework integrating survival data reconstruction, survival model learning, and survival probability estimation. Our approach transforms battery voltage time series into time-to-failure data using path signatures. The multiple Cox-based survival models and machine-learning-based methods, such as DeepHit and MTLR, are learned to predict battery failure-free probabilities over time. Experiments conducted on the Toyota battery and NASA battery datasets demonstrate the effectiveness of our approach, achieving high time-dependent AUC and concordance index (C-Index) while maintaining a low integrated Brier score. The data and source codes for this work are available to the public at https://github.com/thinkxca/rul.
Chinese: 本研究提出的混合生存分析框架通过将电压数据转化为失效时间指标并采用多种生存模型,有效预测锂离子电池的剩余使用寿命,在基准数据集上展现出高精度指标的优越性能。
English: The proposed hybrid survival analysis framework effectively predicts lithium-ion battery remaining useful life by transforming voltage data into time-to-failure metrics and employing multiple survival models, demonstrating superior performance on benchmark datasets with high accuracy scores.

Authors:Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee
Title: Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization
Abstract:
We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.
中文: 决策监督学习(DSL)框架将投资组合优化转化为监督学习问题,采用交叉熵损失和深度集成方法,通过广泛回测显示出优于传统策略和主流机器学习方法的性能。
English: The Decision by Supervised Learning (DSL) framework transforms portfolio optimization into a supervised learning task using cross-entropy loss and Deep Ensemble methods, demonstrating superior performance over traditional and machine learning approaches through extensive backtesting.

Authors:Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee
Title: Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization
Abstract:
We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.
中文: 决策监督学习(DSL)框架将投资组合优化转化为监督学习问题,采用交叉熵损失和深度集成方法,通过广泛回测显示出优于传统策略和主流机器学习方法的性能。
English: The Decision by Supervised Learning (DSL) framework transforms portfolio optimization into a supervised learning task using cross-entropy loss and Deep Ensemble methods, demonstrating superior performance over traditional and machine learning approaches through extensive backtesting.

Authors:Ananya Agarwal, Fnu Alusi, Arbie Hsu, Arif Syraj, Ellen Veomett
Title: States of Disarray: Cleaning Data for Gerrymandering Analysis
Abstract:
The mathematics of redistricting is an area of study that has exploded in recent years. In particular, many different research groups and expert witnesses in court cases have used outlier analysis to argue that a proposed map is a gerrymander. This outlier analysis relies on having an ensemble of potential redistricting maps against which the proposed map is compared. Arguably the most widely-accepted method of creating such an ensemble is to use a Markov Chain Monte Carlo (MCMC) process. This process requires that various pieces of data be gathered, cleaned, and coalesced into a single file that can be used as the seed of the MCMC process. In this article, we describe how we have begun this cleaning process for each state, and made the resulting data available for the public at https://github.com/eveomett-states . At the time of submission, we have data for 22 states available for researchers, students, and the general public to easily access and analyze. We will continue the data cleaning process for each state, and we hope that the availability of these datasets will both further research in this area, and increase the public's interest in and understanding of modern techniques to detect gerrymandering.
Chinese: 本文介绍了为22个州清理并公开重划选区数据的过程,旨在通过马尔可夫链蒙特卡洛方法进行异常分析,以推动研究和提高公众对识别不公正划分选区的认识。
English: This article details the process of cleaning and publicly releasing redistricting data for 22 states to facilitate outlier analysis using Markov Chain Monte Carlo methods, aiming to advance research and public awareness in detecting gerrymandering.

Authors:Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
Title: CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
Abstract:
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie
Chinese: CURIE 是一个科学基准,旨在评估大语言模型在六个学科领域的十项挑战性任务中的理解、推理和信息提取能力,结果显示尽管部分模型表现稳定,但整体仍有巨大提升空间。
English: CURIE is a scientific benchmark designed to evaluate Large Language Models' capabilities in understanding, reasoning, and extracting information across ten challenging tasks in six disciplines, revealing significant room for improvement despite some models showing consistent comprehension.

Authors:Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, Li Shen
Title: MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance
Abstract:
We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being. The dataset is available at https://huggingface.co/datasets/ShenLab/MentalChat16K and the code and documentation are hosted on GitHub at https://github.com/ChiaPatricia/MentalChat16K.
中文: MentalChat16K是一个结合合成与匿名心理健康咨询记录的英文数据集,旨在推动共情式对话助手的AI技术发展,同时严格保障数据隐私与伦理规范。
English: MentalChat16K is a specialized English dataset combining synthetic and anonymized mental health counseling transcripts, designed to advance AI development for empathetic conversational assistance while ensuring privacy and ethical data use.

Authors:Mohammod N. I. Suvon, Shuo Zhou, Prasun C. Tripathi, Wenrui Fan, Samer Alabed, Bishesh Khanal, Venet Osmani, Andrew J. Swift, Chen, Chen, Haiping Lu
Title: Multimodal Latent Fusion of ECG Leads for Early Assessment of Pulmonary Hypertension
Abstract:
Recent advancements in early assessment of pulmonary hypertension (PH) primarily focus on applying machine learning methods to centralized diagnostic modalities, such as 12-lead electrocardiogram (12L-ECG). Despite their potential, these approaches fall short in decentralized clinical settings, e.g., point-of-care and general practice, where handheld 6-lead ECG (6L-ECG) can offer an alternative but is limited by the scarcity of labeled data for developing reliable models. To address this, we propose a lead-specific electrocardiogram multimodal variational autoencoder (\textsc{LS-EMVAE}), which incorporates a hierarchical modality expert (HiME) fusion mechanism and a latent representation alignment loss. HiME combines mixture-of-experts and product-of-experts to enable flexible, adaptive latent fusion, while the alignment loss improves coherence among lead-specific and shared representations. To alleviate data scarcity and enhance representation learning, we adopt a transfer learning strategy: the model is first pre-trained on a large unlabeled 12L-ECG dataset and then fine-tuned on smaller task-specific labeled 6L-ECG datasets. We validate \textsc{LS-EMVAE} across two retrospective cohorts in a 6L-ECG setting: 892 subjects from the ASPIRE registry for (1) PH detection and (2) phenotyping pre-/post-capillary PH, and 16,416 subjects from UK Biobank for (3) predicting elevated pulmonary atrial wedge pressure, where it consistently outperforms unimodal and multimodal baseline methods and demonstrates strong generalizability and interpretability. The code is available at https://github.com/Shef-AIRE/LS-EMVAE.
中文: 针对分散式临床环境中6导联心电图标记数据稀缺的问题,本研究提出LS-EMVAE模型,通过分层专家融合机制和迁移学习策略,在肺动脉高压检测与分型任务中显著优于现有方法,并展现出优异的泛化能力。
English: Recent machine learning methods for pulmonary hypertension assessment using 12-lead ECG face limitations in decentralized settings with 6-lead ECG due to scarce labeled data, prompting the development of LS-EMVAE, a multimodal variational autoencoder that employs transfer learning and hierarchical fusion to enhance performance and generalizability across clinical tasks.

Authors:Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu
Title: Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
Abstract:
When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model's ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.
中文摘要:本研究评估了大语言模型解释的忠实性,提出了两个新指标,发现模型规模越大,在所有测试指标上表现出的忠实性越高。
English Summary: This study evaluates the faithfulness of explanations provided by large language models, introducing two new metrics that reveal a clear scaling trend where larger models demonstrate greater faithfulness across all tested metrics.

Authors:Haoyang Li, Liang Wang, Chao Wang, Jing Jiang, Yan Peng, Guodong Long
Title: DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
Abstract:
The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve base performance without introducing any external knowledge beyond the base classes, while maintaining generalization to new classes. Code is available at: https://github.com/JREion/DPC.
Chinese: 提出的双提示协作(DPC)框架通过解耦基础任务和新任务的优化过程,解决了CLIP提示调优中的基础-新任务权衡问题,在无需外部知识的情况下提升基础类性能的同时保持了对未知类的泛化能力。
English: The proposed Dual-Prompt Collaboration (DPC) framework addresses the Base-New Trade-off problem in CLIP-based prompt tuning by decoupling optimization processes for base and new tasks, enhancing base performance while preserving generalization to unseen classes without external knowledge.

Authors:Yingyue Li, Bencheng Liao, Wenyu Liu, Xinggang Wang
Title: MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
Abstract:
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at http://github.com/hustvl/MaTVLM.
中文:MaTVLM混合模型将Mamba-2层融入预训练视觉语言Transformer中,通过权重初始化和知识蒸馏技术,在保持性能的同时实现了更快的推理速度和更低的内存消耗。
English: The MaTVLM hybrid model integrates Mamba-2 layers into a pre-trained visual-language transformer, achieving competitive performance with faster inference and lower memory usage through weight initialization and knowledge distillation.

Authors:Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan
Title: WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
Abstract:
With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods. Project: https://github.com/Gen-Verse/WideRange4D
中文摘要:本文针对现有4D重建方法难以处理大范围空间运动的局限性,提出了包含丰富空间变化数据的WideRange4D新基准和Progress4D新方法,在复杂4D场景重建任务中实现了更优性能。
English Summary: This paper introduces WideRange4D, a novel benchmark addressing the limitations of existing 4D reconstruction methods in handling large spatial movements, and proposes Progress4D, a new method that achieves superior performance in complex 4D scene reconstruction tasks.

Authors:Johan Edstedt
Title: Less Biased Noise Scale Estimation for Threshold-Robust RANSAC
Abstract:
The gold-standard for robustly estimating relative pose through image matching is RANSAC. While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set. Setting this threshold is typically done by hand, and is difficult to tune without an access to ground truth data. Thus, a method capable of automatically determining the optimal threshold would be desirable. In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold. We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale. In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account. Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results. Our approach yields robust performance across a range of thresholds, shown in Figure 1. Code is available at https://github.com/Parskatt/simfitpp
中文: 本文重新审视并改进了SIMFIT方法,通过修正噪声尺度估计偏差和引入多对扩展,实现了无需手动调参的鲁棒相对位姿估计。
English: This paper revisits and improves SIMFIT for automatic inlier noise scale estimation, addressing bias and underestimation to enhance robust relative pose estimation without manual threshold tuning.

Authors:Nhi Pham, Artur Jesslen, Bernt Schiele, Adam Kortylewski, Jonas Fischer
Title: Interpretable 3D Neural Object Volumes for Robust Conceptual Reasoning
Abstract:
With the rise of deep neural networks, especially in safety-critical applications, robustness and interpretability are crucial to ensure their trustworthiness. Recent advances in 3D-aware classifiers that map image features to volumetric representation of objects, rather than relying solely on 2D appearance, have greatly improved robustness on out-of-distribution (OOD) data. Such classifiers have not yet been studied from the perspective of interpretability. Meanwhile, current concept-based XAI methods often neglect OOD robustness. We aim to address both aspects with CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design CAVE as a robust and inherently interpretable classifier that learns sparse concepts from 3D object representation. We further propose 3D Consistency (3D-C), a metric to measure spatial consistency of concepts. Unlike existing metrics that rely on human-annotated parts on images, 3D-C leverages ground-truth object meshes as a common surface to project and compare explanations across concept-based methods. CAVE achieves competitive classification performance while discovering consistent and meaningful concepts across images in various OOD settings. Code available at https://github.com/phamleyennhi/CAVE.
Chinese: CAVE提出了一种统一方法,通过从3D物体表征中学习稀疏概念来实现鲁棒且可解释的图像分类,在分布外场景下保持竞争力并发现一致的概念。
English: CAVE introduces a unified approach for robust and interpretable image classification by learning sparse concepts from 3D object representations, achieving competitive performance and consistent concept discovery across out-of-distribution settings.

Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter
Title: xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
Abstract:
Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.
中文: 基于xLSTM架构的最新突破催生了xLSTM 7B模型,这个70亿参数的模型在保持优异任务性能的同时,实现了比同类模型更快的推理速度和更高的效率。
English: Recent advances in xLSTM architecture enable the development of xLSTM 7B, a 7-billion-parameter model that outperforms similar-sized LLMs in inference speed and efficiency while maintaining competitive task performance.

Authors:Xinyu Lian, Zichao Yu, Ruiming Liang, Yitong Wang, Li Ray Luo, Kaixu Chen, Yuanzhen Zhou, Qihong Tang, Xudong Xu, Zhaoyang Lyu, Bo Dai, Jiangmiao Pang
Title: Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation
Abstract:
Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at https://github.com/Intern-Nexus/Infinite-Mobility
中文: "无限机动性"方法通过程序化生成高保真铰接物体,在物理属性和网格质量上超越现有最优方法并媲美人工作数据集,同时可作为生成模型的扩展训练数据。
English: The proposed "Infinite Mobility" method procedurally generates high-fidelity articulated objects that surpass current state-of-the-art techniques and match human-annotated datasets in physics and mesh quality, while also serving as scalable training data for generative models.

Authors:Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin
Title: DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective
Abstract:
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts. However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability. To mitigate this, recent studies have explored automated prompt optimization as a promising solution. Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm. Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization. Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation. We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization. Our code is available at https://github.com/sfasfaffa/DLPO.
中文摘要:本研究受深度学习范式启发提出七种创新方法,通过经验分析和广泛实验实现大语言模型的自动提示优化,有效解决了鲁棒性、效率与泛化能力等关键挑战。
English Summary: This study introduces seven innovative methods inspired by deep learning paradigms to automate prompt optimization for LLMs, addressing key challenges in robustness, efficiency, and generalization through empirical analysis and extensive experimentation.

Authors:Qi Zhang, Xiuyuan Chen, Ziyi He, Kun Wang, Lianming Wu, Hongxing Shen, Jianqi Sun
Title: U2AD: Uncertainty-based Unsupervised Anomaly Detection Framework for Detecting T2 Hyperintensity in MRI Spinal Cord
Abstract:
T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling alternative by eliminating the need for abnormal data annotations. However, existing UAD methods rely on curated normal datasets and their performance frequently deteriorates when applied to clinical datasets due to domain shifts. We propose an Uncertainty-based Unsupervised Anomaly Detection framework, termed U2AD, to address these limitations. Unlike traditional methods, U2AD is designed to be trained and tested within the same clinical dataset, following a "mask-and-reconstruction" paradigm built on a Vision Transformer-based architecture. We introduce an uncertainty-guided masking strategy to resolve task conflicts between normal reconstruction and anomaly detection to achieve an optimal balance. Specifically, we employ a Monte-Carlo sampling technique to estimate reconstruction uncertainty mappings during training. By iteratively optimizing reconstruction training under the guidance of both epistemic and aleatoric uncertainty, U2AD reduces overall reconstruction variance while emphasizing regions. Experimental results demonstrate that U2AD outperforms existing supervised and unsupervised methods in patient-level identification and segment-level localization tasks. This framework establishes a new benchmark for incorporating uncertainty guidance into UAD, highlighting its clinical utility in addressing domain shifts and task conflicts in medical image anomaly detection. Our code is available: https://github.com/zhibaishouheilab/U2AD
中文摘要:提出的U2AD框架通过基于视觉变换器的不确定性引导掩码重建方法,在无需异常标注数据的情况下实现了优于现有方法的脊髓异常检测性能。
English Summary: The proposed U2AD framework leverages uncertainty-guided mask-and-reconstruction with Vision Transformers to outperform existing methods in spinal cord anomaly detection while eliminating dependency on annotated abnormal data.

Authors:Qing Zhou, Junyu Gao, Qi Wang
Title: Scale Efficient Training for Large Datasets
Abstract:
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.
中文: 提出的规模高效训练(SeTa)方法通过动态剪除大规模数据集中的低价值样本,可在保持性能的同时将训练时间减少高达50%,并在多种数据集和架构中验证了其有效性。
English: The proposed Scale Efficient Training (SeTa) method dynamically prunes low-value samples from large datasets to reduce training time by up to 50% without compromising performance, demonstrating effectiveness across multiple datasets and architectures.

Authors:Jiaming Kang, Keyan Chen, Zhengxia Zou, Zhenwei Shi
Title: TriDF: Triplane-Accelerated Density Fields for Few-Shot Remote Sensing Novel View Synthesis
Abstract:
Remote sensing novel view synthesis (NVS) offers significant potential for 3D interpretation of remote sensing scenes, with important applications in urban planning and environmental monitoring. However, remote sensing scenes frequently lack sufficient multi-view images due to acquisition constraints. While existing NVS methods tend to overfit when processing limited input views, advanced few-shot NVS methods are computationally intensive and perform sub-optimally in remote sensing scenes. This paper presents TriDF, an efficient hybrid 3D representation for fast remote sensing NVS from as few as 3 input views. Our approach decouples color and volume density information, modeling them independently to reduce the computational burden on implicit radiance fields and accelerate reconstruction. We explore the potential of the triplane representation in few-shot NVS tasks by mapping high-frequency color information onto this compact structure, and the direct optimization of feature planes significantly speeds up convergence. Volume density is modeled as continuous density fields, incorporating reference features from neighboring views through image-based rendering to compensate for limited input data. Additionally, we introduce depth-guided optimization based on point clouds, which effectively mitigates the overfitting problem in few-shot NVS. Comprehensive experiments across multiple remote sensing scenes demonstrate that our hybrid representation achieves a 30x speed increase compared to NeRF-based methods, while simultaneously improving rendering quality metrics over advanced few-shot methods (7.4% increase in PSNR, 12.2% in SSIM, and 18.7% in LPIPS). The code is publicly available at https://github.com/kanehub/TriDF
中文: 本文提出的TriDF混合三维表示方法,仅需三个输入视图即可实现快速遥感新视角合成,相比基于NeRF的方法提速30倍,同时显著提升了多项渲染质量指标。
English: This paper introduces TriDF, an efficient hybrid 3D representation that enables fast remote sensing novel view synthesis from just three input views, achieving a 30x speed increase over NeRF-based methods while improving rendering quality metrics.

Authors:Ying Jiao, Luc De Raedt, Giuseppe Marra
Title: Valid Text-to-SQL Generation with Unification-based DeepStochLog
Abstract:
Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.
中文摘要:本文提出了一种神经符号框架,通过基于合一的定子句语法施加SQL语法和数据库模式约束,确保生成的所有查询均有效,大幅提升了语言模型在自然语言转SQL任务中的表现。
English Summary: This paper introduces a neurosymbolic framework that enforces SQL syntax and schema constraints using unification-based grammars to ensure all generated queries are valid, significantly improving the language model's performance in natural language to SQL translation.

Authors:Witold Wydmański, Marek Śmieja
Title: GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation
Abstract:
Feature selection in deep learning remains a critical challenge, particularly for high-dimensional tabular data where interpretability and computational efficiency are paramount. We present GFSNetwork, a novel neural architecture that performs differentiable feature selection through temperature-controlled Gumbel-Sigmoid sampling. Unlike traditional methods, where the user has to define the requested number of features, GFSNetwork selects it automatically during an end-to-end process. Moreover, GFSNetwork maintains constant computational overhead regardless of the number of input features. We evaluate GFSNetwork on a series of classification and regression benchmarks, where it consistently outperforms recent methods including DeepLasso, attention maps, as well as traditional feature selectors, while using significantly fewer features. Furthermore, we validate our approach on real-world metagenomic datasets, demonstrating its effectiveness in high-dimensional biological data. Concluding, our method provides a scalable solution that bridges the gap between neural network flexibility and traditional feature selection interpretability. We share our python implementation of GFSNetwork at https://github.com/wwydmanski/GFSNetwork, as well as a PyPi package (gfs_network).
中文: GFSNetwork提出了一种新颖的神经网络架构,能够通过可微分特征选择自动处理高维数据,在基准测试和实际应用中优于现有方法,同时保持计算效率和可解释性。
English: GFSNetwork introduces a novel neural architecture for automatic and differentiable feature selection in high-dimensional data, outperforming existing methods in benchmarks and real-world applications while maintaining computational efficiency and interpretability.

Authors:Yinqiao Wang, Hao Xu, Pheng-Ann Heng, Chi-Wing Fu
Title: UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation
Abstract:
Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on https://github.com/JoyboyWang/UniHOPE_Pytorch.
Chinese: UniHOPE提出了一种统一的三维手-物姿态估计方法,通过抓取感知特征融合和遮挡不变学习,灵活适应空手和手持物体两种场景,实现了最先进的性能。
English: UniHOPE is a unified method for 3D hand-object pose estimation that adapts to both bare-hand and hand-object scenarios through grasp-aware feature fusion and occlusion-invariant learning, achieving state-of-the-art performance.

Authors:Ling-An Zeng, Gaojie Wu, Ancong Wu, Jian-Fang Hu, Wei-Shi Zheng
Title: Progressive Human Motion Generation Based on Text and Few Motion Frames
Abstract:
Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code is available at https://github.com/qinghuannn/PMG.
中文: 本文提出的渐进式运动生成方法通过结合少量用户提供的帧来增强文本到运动的生成,实现了更精确的姿态控制,并显著超越了现有方法。
English: This paper introduces a Progressive Motion Generation method that enhances text-to-motion synthesis by incorporating a few user-provided frames, enabling more precise posture control and significantly outperforming existing methods.

Authors:Mikkel Jordahn, Jonas Vestergaard Jensen, Mikkel N. Schmidt, Michael Riis Andersen
Title: On Local Posterior Structure in Deep Ensembles
Abstract:
Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.
中文: 贝叶斯神经网络的深度集成虽然在理论上具有优势,但在分布内数据上表现反而不如标准深度集成,尽管其在分布外检测方面表现优异,但这是以牺牲分布内性能为代价的。
English: Deep ensembles of Bayesian Neural Networks surprisingly underperform standard deep ensembles on in-distribution data despite theoretical advantages, though they excel in out-of-distribution detection at the cost of in-distribution performance.

Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Jun Liu, Qika Lin, Zhiyong Wu
Title: $ϕ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
Abstract:
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $ϕ$-Decoding. To provide a precise and expressive estimation of step value, $ϕ$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $ϕ$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.
中文:提出的$ϕ$-解码策略通过前瞻采样和聚类优化推理步骤,在多个基准测试中实现了卓越的性能与效率,并通过剪枝技术支持自适应计算分配。
English: The proposed $ϕ$-Decoding strategy uses foresight sampling and clustering to optimize reasoning steps, achieving superior performance and efficiency across benchmarks while supporting adaptive computation through pruning techniques.

Authors:Zhifu Tian, Tao Hu, Chaoyang Niu, Di Wu, Shu Wang
Title: Sampling Innovation-Based Adaptive Compressive Sensing
Abstract:
Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects. Codes are available at https://github.com/giant-pandada/SIB-ACS_CVPR2025.
中文摘要:提出的SIB-ACS方法通过创新准则和多阶段自适应采样框架,能有效识别并分配采样至图像重建困难区域,在重建保真度方面显著优于现有方法。
English Summary: The proposed SIB-ACS method introduces an innovation criterion and multi-stage adaptive sampling framework to effectively allocate samples to challenging reconstruction areas, significantly outperforming existing methods in image reconstruction fidelity.

Authors:Yijie Liu, Xinyi Shang, Yiqun Zhang, Yang Lu, Chen Gong, Jing-Hao Xue, Hanzi Wang
Title: Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
Abstract:
Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models' predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at https://github.com/Jay-Codeman/SAGE
中文: 联邦半监督学习(FSSL)面临数据异构性导致伪标签质量下降的问题,而提出的SAGE方法通过置信度差异灵活修正伪标签,有效提升了模型性能和收敛速度。
English: Federated Semi-Supervised Learning (FSSL) faces challenges from data heterogeneity that degrade pseudo-label quality, but the proposed SAGE method effectively corrects pseudo-labels using confidence discrepancies to enhance model performance and convergence.

Authors:Chi Han, Xin Liu, Haodong Wang, Shiyang Li, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Qingyu Yin, Liang Qiu, Changlong Yu, Yifan Gao, Zheng Li, Bing Yin, Jingbo Shang, Heng Ji
Title: Can Language Models Follow Multiple Turns of Entangled Instructions?
Abstract:
Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct~with $\sim$1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks. Still, their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions. Data and codes are released at https://github.com/Glaciohound/Multi-Turn-Instruct.
中文: 本研究系统评估了大语言模型处理多轮指令的能力,发现模型存在能力权衡——虽在记忆方面表现优异,但即使大型模型具备更强推理能力,仍在冲突解决和隐私保护任务上存在明显不足。
English: This study systematically evaluates large language models' ability to handle multi-turn instructions, revealing a trade-off where models excel at memorization but struggle with conflict resolution and privacy protection despite strong reasoning capabilities in larger models.

Authors:Jie Huang, Haorui Chen, Jiaxuan Ren, Siran Peng, Liangjian Deng
Title: A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
Abstract:
Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. We use the covariance matrix to model the feature heterogeneity and redundancy and propose Correlation-Aware Covariance Weighting (CACW) to adjust them. CACW captures these correlations through the covariance matrix, which is then processed by a nonlinear function to generate weights for adjustment. Building upon CACW, we introduce a general adaptive dual-level weighting mechanism (ADWM) to address these challenges from two key perspectives, enhancing a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, redundancy visualization, comparison experiments, key variables and complexity analysis, and ablation studies. Our code is available at https://github.com/Jie-1203/ADWM.
中文摘要:本文提出相关性感知协方差加权(CACW)方法和自适应双级加权机制(ADWM),通过特征内加权和跨特征加权解决遥感图像融合中的特征异构性与冗余问题,实验证明该方法优于现有先进技术。
English Summary: This paper introduces a Correlation-Aware Covariance Weighting (CACW) method and an Adaptive Dual-level Weighting Mechanism (ADWM) to address feature heterogeneity and redundancy in remote sensing pansharpening, demonstrating superior performance over existing methods through comprehensive experiments.

Authors:Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
Title: Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training
Abstract:
Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method outperforms state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. It is fully explainable, and requires no learning or parameter tuning. Alpine combined with state-of-the-art semantic segmentation ranks first on the official panoptic segmentation leaderboard of SemanticKITTI. Code is available at https://github.com/valeoai/Alpine/
Chinese: 本研究提出了一种仅使用语义标签即可实现激光雷达点云全景分割的方法,无需实例标注即可达到先进性能,并能在CPU上实时高效运行。
English: This study presents a method for panoptic segmentation of LiDAR point clouds that achieves competitive performance using only semantic labels, eliminating the need for instance annotations while running efficiently in real-time on a CPU.

Authors:Matteo Sodano, Federico Magistri, Elias Marks, Fares Hosn, Aibek Zurbayev, Rodrigo Marcuzzi, Meher V. R. Malladi, Jens Behley, Cyrill Stachniss
Title: 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors
Abstract:
Crop yield estimation is a relevant problem in agriculture, because an accurate yield estimate can support farmers' decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects such as trees and plants. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of trees (a trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. To efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn, Germany, in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robots platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset is publicly available at https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our approach is available at https://github.com/PRBonn/hapt3D.
中文: 本文提出了一种新颖的农业三维数据分层全景分割方法,能同时识别树木、树干和果实并建立其关联关系,在苹果园数据集上的表现优于现有方法,并提供了开源实现。
English: This paper introduces a novel hierarchical panoptic segmentation method for 3D agricultural data that simultaneously identifies trees, trunks, and fruits while capturing their relationships, outperforming existing approaches and providing an open-source dataset from apple orchards.

Authors:Yuanze Li, Shihao Yuan, Haolin Wang, Qizhang Li, Ming Liu, Chen Xu, Guangming Shi, Wangmeng Zuo
Title: Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process
Abstract:
Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at https://github.com/tzjtatata/Triad.
中文摘要:本研究提出Triad新型大模型,通过结合专家引导的缺陷区域识别与制造工艺认知,显著提升了工业异常检测性能,实验证明其准确率优于现有方法。
English Summary: The study introduces Triad, a novel large multimodal model that enhances industrial anomaly detection by integrating expert-guided defect area identification and manufacturing process insights, achieving superior accuracy over existing methods.

Authors:Chen Zhao, Zhizhou Chen, Yunzhe Xu, Enxuan Gu, Jian Li, Zili Yi, Qian Wang, Jian Yang, Ying Tai
Title: From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
Abstract:
Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed kolmogorov-arnold networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component. The code is available at \href{https://github.com/NJU-PCALab/ERR}{here}.
中文: 提出的ERR框架将超高清图像修复分解为零频增强、低频修复和高频细化三个渐进阶段,通过专门设计的子网络协同工作,在多项任务中显著超越了现有方法。
English: The proposed ERR framework addresses ultra-high-definition image restoration by decomposing it into three progressive stages—zero-frequency enhancement, low-frequency restoration, and high-frequency refinement—using specialized sub-networks, significantly outperforming existing methods across multiple tasks.

Authors:Yaxi Chen, Simin Ni, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Title: Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images
Abstract:
Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: https://github.com/YaxiiC/RadiomicsPersona.git
中文摘要:本研究通过逻辑回归模型改进针对个体的影像组学特征选择,并利用扩散模型生成健康基线特征扩展特征库,在临床分类任务中实现了与深度学习相当的性能,同时保持了方法的可解释性。
English Summary: This study enhances radiomic feature selection for individual patients using a logistic regression model and expands the feature pool with healthy personas generated via diffusion models, achieving performance comparable to deep learning while maintaining interpretability in clinical classification tasks.

Authors:Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng
Title: ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation
Abstract:
We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{https://github.com/qinghuannn/ChainHOI}{here}.
中文: ChainHOI提出了一种新颖的文本驱动人-物交互生成方法,通过在关节和运动链层面显式建模交互,相比现有方法能生成更真实且生物力学一致的动作。
English: ChainHOI introduces a novel approach for text-driven human-object interaction generation by explicitly modeling interactions at both joint and kinetic chain levels, outperforming existing methods in producing realistic and biomechanically coherent motions.

Authors:Jing Li, Yihang Fu, Falai Chen
Title: DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
Abstract:
Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at https://github.com/jinli99/DTGBrepGen.
中文: DTGBrepGen提出了一种拓扑-几何解耦框架,首先生成有效的拓扑结构,然后使用基于Transformer的扩散模型进行顺序几何生成,在有效性和准确性上显著优于现有方法。
English: DTGBrepGen introduces a topology-geometry decoupled framework that first generates valid topological structures and then uses Transformer-based diffusion models for sequential geometry generation, significantly outperforming existing methods in validity and accuracy.

Authors:Zhicheng Zhao, Jinquan Yan, Chenglong Li, Xiao Wang, Jin Tang
Title: DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model
Abstract:
Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non-uniform haze distribution, which traditional single-image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze-free reference information for large-scale scenes, existing SAR-guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze-free regions, and the instability of feature quality further exacerbates cross-modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR-guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze-affected regions through optical-SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two-stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large-scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo-registered SAR-optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state-of-the-art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
中文摘要:本文提出DehazeMamba网络,通过渐进式雾霾解耦融合策略解决大范围光学遥感图像去雾难题,并发布MRSHaze基准数据集,实验证明其性能显著优于现有先进方法。
English Summary: This paper introduces DehazeMamba, a SAR-guided dehazing network using progressive haze decoupling fusion to address challenges in large-scale optical remote sensing image dehazing, along with the MRSHaze benchmark dataset demonstrating superior performance over existing methods.

Authors:Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, Di Hu
Title: Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Abstract:
In recent years, numerous tasks have been proposed to encourage model to develop specified capability in understanding audio-visual scene, primarily categorized into temporal localization, spatial localization, spatio-temporal reasoning, and pixel-level understanding. Instead, human possesses a unified understanding ability for diversified tasks. Therefore, designing an audio-visual model with general capability to unify these tasks is of great value. However, simply joint training for all tasks can lead to interference due to the heterogeneity of audiovisual data and complex relationship among tasks. We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. Specifically, considering the labels of existing datasets are simple words, we carefully refine these datasets and construct an Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process (AV-UIE), which clarifies the cooperative relationship among tasks. Subsequently, to facilitate concrete cooperation in learning stage, an interaction-aware LoRA structure with multiple LoRA heads is designed to learn different aspects of audiovisual data interaction. By unifying the explicit cooperation across the data and model aspect, our method not only surpasses existing unified audio-visual model on multiple tasks, but also outperforms most specialized models for certain tasks. Furthermore, we also visualize the process of explicit cooperation and surprisingly find that each LoRA head has certain audio-visual understanding ability. Code and dataset: https://github.com/GeWu-Lab/Crab
中文: 本文提出了一种统一学习方法,通过精炼数据集和交互感知的LoRA结构实现任务间显式协作,在多项视听任务上超越了现有模型。
English: This paper introduces a unified learning method that achieves explicit inter-task cooperation through a refined dataset and an interaction-aware LoRA structure, outperforming existing models across multiple audio-visual tasks.

Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Title: E-Values Expand the Scope of Conformal Prediction
Abstract:
Conformal prediction is a powerful framework for distribution-free uncertainty quantification. The standard approach to conformal prediction relies on comparing the ranks of prediction scores: under exchangeability, the rank of a future test point cannot be too extreme relative to a calibration set. This rank-based method can be reformulated in terms of p-values. In this paper, we explore an alternative approach based on e-values, known as conformal e-prediction. E-values offer key advantages that cannot be achieved with p-values, enabling new theoretical and practical capabilities. In particular, we present three applications that leverage the unique strengths of e-values: batch anytime-valid conformal prediction, fixed-size conformal sets with data-dependent coverage, and conformal prediction under ambiguous ground truth. Overall, these examples demonstrate that e-value-based constructions provide a flexible expansion of the toolbox of conformal prediction.
Chinese: 保形电子预测提出了一种基于电子值的替代方法,相较于传统保形预测,它在批量随时有效推断、数据依赖性覆盖和处理模糊真实情况方面具有独特优势,从而扩展了该框架的灵活性和应用范围。
English: Conformal e-prediction introduces an e-value-based alternative to traditional conformal prediction, offering unique advantages such as batch anytime-valid inference, data-dependent coverage, and handling of ambiguous ground truth, thereby expanding the framework's flexibility and applicability.

Authors:Ruiqi Song, Xianda Guo, Hangbin Wu, Qinggong Wei, Long Chen
Title: InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving
Abstract:
Directly generating planning results from raw sensors has become increasingly prevalent due to its adaptability and robustness in complex scenarios. Scene representation, as a key module in the pipeline, has traditionally relied on conventional perception, which focus on the global scene. However, in driving scenarios, human drivers typically focus only on regions that directly impact driving, which often coincide with those required for end-to-end autonomous driving. In this paper, a novel end-to-end autonomous driving method called InsightDrive is proposed, which organizes perception by language-guided scene representation. We introduce an instance-centric scene tokenizer that transforms the surrounding environment into map- and object-aware instance tokens. Scene attention language descriptions, which highlight key regions and obstacles affecting the ego vehicle's movement, are generated by a vision-language model that leverages the cognitive reasoning capabilities of foundation models. We then align scene descriptions with visual features using the vision-language model, guiding visual attention through these descriptions to give effectively scene representation. Furthermore, we employ self-attention and cross-attention mechanisms to model the ego-agents and ego-map relationships to comprehensively build the topological relationships of the scene. Finally, based on scene understanding, we jointly perform motion prediction and planning. Extensive experiments on the widely used nuScenes benchmark demonstrate that the proposed InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving. The code is available at https://github.com/songruiqi/InsightDrive
中文摘要:InsightDrive提出了一种新颖的端到端自动驾驶方法,通过语言引导的场景表征和以实例为中心的标记来聚焦关键驾驶区域,借助集成运动预测与规划实现了最优性能。
English Summary: InsightDrive introduces a novel end-to-end autonomous driving method that uses language-guided scene representation and instance-centric tokens to focus on key driving regions, achieving state-of-the-art performance through integrated motion prediction and planning.

Authors:Gabriele Berton, Kevin Musgrave, Carlo Masone
Title: All You Need to Know About Training Image Retrieval Models
Abstract:
Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at https://github.com/gmberton/image-retrieval
图像检索性能受多种训练因素影响,本研究通过大量实验分析这些因素的作用并建立了跨数据集的最佳实践方案。
Image retrieval performance is influenced by multiple training factors, and this study conducts extensive experiments to analyze their effects and establish cross-dataset best practices.

Authors:Tao Wang, Changxu Cheng, Lingfeng Wang, Senda Chen, Wuyue Zhao
Title: HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Abstract:
The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community. To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input. However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs. In this work, we propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning. Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential. Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.
中文: 本文提出的HiMTok分层掩码标记器使用最多32个标记表示分割掩码,无需原始图像即可解码,实现了与LLM范式对齐的紧凑型由粗到精掩码表示,在取得最先进分割性能的同时增强了视觉定位能力。
English: This paper introduces HiMTok, a hierarchical mask tokenizer that represents segmentation masks with up to 32 tokens, eliminating the need for original images during decoding and enabling compact coarse-to-fine mask representation aligned with LLM paradigms, achieving state-of-the-art segmentation performance while enhancing visual grounding.

Authors:Xingguo Lv, Xingbo Dong, Liwen Wang, Jiewen Yang, Lei Zhao, Bin Pu, Zhe Jin, Xuejun Li
Title: Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
Abstract:
Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at https://github.com/Yore0/TTDG-MGM.
The proposed framework integrates morphological priors through multi-graph matching and learnable universe embeddings to enhance test-time adaptation for medical image segmentation, effectively addressing domain shifts and outperforming existing methods on benchmark tasks.
English Summary:

Authors:Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Qizhe Zhang, Yulin Luo, Hao Liang, Wentao Zhang
Title: Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization
Abstract:
Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at $\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$.
Chinese: 本研究提出概念树框架,通过生成高质量正负样本增强视觉语言模型的个性化能力,以可控合成数据管道有效解决样本稀缺和质量问题。
English: This study introduces Concept-as-Tree (CaT), a framework that generates high-quality positive and negative samples to enhance Vision-Language Model personalization, effectively addressing data scarcity and quality issues through a controllable synthetic data pipeline.

Authors:Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, Botian Shi
Title: Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Abstract:
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at https://github.com/Wings-Of-Disaster/VaLiK.
中文:VaLiK框架通过跨模态对齐和噪声过滤构建多模态知识图谱,无需人工标注即可增强大语言模型的多模态推理能力,实现更优性能与存储效率。
English: The VaLiK framework enhances multimodal reasoning in LLMs by constructing MMKGs through cross-modal alignment and noise filtering, achieving superior performance and storage efficiency without manual annotations.

Authors:Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
Title: Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Abstract:
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.
中文摘要:KDTalker首次将无监督隐式3D关键点与时空扩散模型相结合,能生成唇部同步精准、头部姿态多样且计算高效的高质量动态人像。
English Summary: KDTalker introduces a novel framework combining unsupervised implicit 3D keypoints with a spatiotemporal diffusion model to generate high-quality talking portraits with accurate lip synchronization, diverse head poses, and enhanced computational efficiency.

Authors:Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, Shiguang Shan
Title: HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
Abstract:
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models. The codes and data: https://github.com/ZJHTerry18/HumanInScene.
中文摘要:本文提出了人类场景问答(HIS-QA)新任务以评估具身智能体在三维场景中对人类行为的理解能力,并开发了首个基础模型HIS-GPT,通过融合三维场景上下文与人体运动动态实现了最先进的性能表现。
English Summary: This paper introduces Human-In-Scene Question Answering (HIS-QA), a novel task for evaluating embodied agents' understanding of human behavior in 3D environments, and proposes HIS-GPT, a foundation model that achieves state-of-the-art performance by integrating 3D scene context with human motion dynamics.

Authors:Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction
Abstract:
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at https://github.com/Cuberick-Orion/FCA .
中文: 本文提出帧级条件自适应(FCA)方法,通过生成帧级文本嵌入来微调文生视频模型,有效提升文本视频预测任务中的运动连续性,实现了最先进的性能。
English: This paper introduces Frame-wise Conditioning Adaptation (FCA), a novel method that fine-tunes text-to-video models by generating frame-wise text embeddings to enhance motion continuity in text-video prediction tasks, achieving state-of-the-art results.

Authors:Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Title: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
Abstract:
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.
中文: 本文提出了一种基于层间相似度变化的多模态大语言模型持续指令调优框架,通过任务特定扩展与任务通用融合提升模型适应性,并构建了更严谨的基准测试以解决现有评估中的信息泄露问题。
English: This paper introduces a framework for continual instruction tuning of Multimodal Large Language Models that enhances adaptability by balancing task-specific expansion and task-general fusion, while also proposing a more challenging benchmark to address information leakage in existing evaluations.

Authors:Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Title: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction
Abstract:
Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
中文摘要:提出的AR-1-to-3方法通过渐进式视图合成和专门设计的编码策略,相比现有方法显著提升了视图一致性并生成高质量3D资源。
English Summary: The proposed AR-1-to-3 method progressively synthesizes novel views using diffusion models and specialized encoding strategies, significantly improving view consistency and 3D asset quality compared to existing approaches.

Authors:Huangwei Chen, Yifei Chen, Zhenyu Yan, Mingyang Ding, Chenlei Li, Zhu Zhu, Feiwei Qin
Title: MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation
Abstract:
Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at https://github.com/HovChen/MMLNB.
中文:MMLNB模型通过双分支架构和渐进式融合,整合病理图像与生成文本描述,显著提高了神经母细胞瘤分型的准确性和可解释性,优于单模态方法。
English: The MMLNB model integrates pathological images with generated text descriptions using a dual-branch architecture and progressive fusion to enhance accuracy and interpretability in neuroblastoma subtyping, outperforming single-modal approaches.

Authors:Zhuoqun Su, Huimin Lu, Shuaifeng Jiao, Junhao Xiao, Yaonan Wang, Xieyuanli Chen
Title: Efficient Multimodal 3D Object Detector via Instance-Level Contrastive Distillation
Abstract:
Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.
中文摘要:本文提出了一种快速高效的多模态3D目标检测器,通过实例级对比蒸馏框架和交叉线性注意力融合模块解决模态差异并增强全局特征交互,在基准测试中实现最优性能且已开源。
English Summary: This paper introduces a fast and effective multimodal 3D object detector using an Instance-level Contrastive Distillation framework and Cross Linear Attention Fusion Module to address cross-modal heterogeneity and enhance global feature interactions, achieving state-of-the-art performance on benchmarks with open-source implementation.

Authors:Haiyang Guo, Fanhu Zeng, Fei Zhu, Wenzhuo Liu, Da-Han Wang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu
Title: Federated Continual Instruction Tuning
Abstract:
A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Code and dataset are released at https://github.com/Ghy0501/FCIT.
中文:联邦持续指令调优(FCIT)通过动态整合分布式客户端的知识,并采用子空间选择性激活技术来缓解灾难性遗忘,有效解决了大规模多模态模型训练中的计算成本高和数据异构性问题。
English: Federated Continual Instruction Tuning (FCIT) is proposed to address the challenges of computational costs and data heterogeneity in training Large Multimodal Models by dynamically integrating knowledge from distributed clients while mitigating catastrophic forgetting through subspace activation techniques.

Authors:Siyuan Yao, Yang Guo, Yanyang Yan, Wenqi Ren, Xiaochun Cao
Title: UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network
Abstract:
Transformer-based trackers have achieved promising success and become the dominant tracking paradigm due to their accuracy and efficiency. Despite the substantial progress, most of the existing approaches tackle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been greatly overlooked, which hampers trackers' ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack utilizes a transformer encoder to perform feature interaction between template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable or not. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, making the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. Our code is available at https://github.com/ManOfStory/UncTrack.
中文: 基于Transformer的跟踪器虽在精度和效率上表现出色,但常忽略目标定位的不确定性,因此UncTrack提出一种不确定性感知方法,通过预测并利用该不确定性来实现更可靠的状态推断,并在复杂场景中展现更强的鲁棒性。
English: Transformer-based trackers excel in accuracy and efficiency but often overlook target localization uncertainty, so UncTrack introduces an uncertainty-aware approach that predicts and utilizes this uncertainty for more reliable state inference and robust performance in challenging scenarios.

Authors:Linzhou Li, Yumeng Li, Yanlin Weng, Youyi Zheng, Kun Zhou
Title: RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars
Abstract:
We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. Unlike prior approaches that utilize linear bases from 3D morphable models (3DMM) to model Gaussian blendshapes, our method maps tracked 3DMM parameters into reduced blendshape weights with an MLP, leading to a compact set of blendshape bases. The learned compact base composition effectively captures essential facial details for specific individuals, and does not rely on the fixed base composition weights of 3DMM, leading to enhanced reconstruction quality and higher efficiency. To further expedite the reconstruction process, we develop a novel color initialization estimation method and a batch-parallel Gaussian rasterization process, achieving state-of-the-art quality with training throughput of about 630 images per second. Moreover, we propose a local-global sampling strategy that enables direct on-the-fly reconstruction, immediately reconstructing the model as video streams in real time while achieving quality comparable to offline settings. Our source code is available at https://github.com/gapszju/RGBAvatar.
RGBAvatar提出了一种实时重建逼真头部化身的方法,通过MLP将3DMM参数映射为紧凑形变基,结合创新的初始化与光栅化技术,实现了高效高质量的实时重建。
RGBAvatar introduces a method for real-time photorealistic head avatar reconstruction by mapping 3DMM parameters to compact blendshape bases via MLP, achieving high efficiency and quality with innovative initialization and rasterization techniques.

Authors:Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, Xiaochun Cao
Title: Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models
Abstract:
Large pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive generalization but remain highly vulnerable to adversarial examples (AEs). Previous work has explored robust text prompts through adversarial training, achieving some improvement in both robustness and generalization. However, they primarily rely on singlegradient direction perturbations (e.g., PGD) to generate AEs, which lack diversity, resulting in limited improvement in adversarial robustness. To address these limitations, we propose an evolution-based region adversarial prompt tuning method called ER-APT, which combines gradient methods with genetic evolution to generate more diverse and challenging AEs. In each training iteration, we first generate AEs using traditional gradient-based methods. Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs, ensuring a broader and more aggressive perturbation distribution.The final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning. We also propose a dynamic loss weighting method to adjust prompt learning efficiency for accuracy and robustness. Experimental evaluations on various benchmark datasets demonstrate the superiority of our proposed method, outperforming stateof-the-art APT methods. The code is released at https://github.com/jiaxiaojunQAQ/ER-APT.
中文: 本文提出ER-APT方法,通过结合梯度技术与遗传进化生成更多样化的对抗样本,实现了基于区域的对抗提示优化,在多个基准测试中显著提升了对抗鲁棒性并优于现有先进方法。
English: This paper introduces ER-APT, an evolution-based region adversarial prompt tuning method that combines gradient techniques with genetic evolution to generate more diverse adversarial examples, significantly enhancing adversarial robustness and outperforming existing methods across multiple benchmarks.

Authors:Chenyu Zhang, Kunlun Xu, Zichen Liu, Yuxin Peng, Jiahuan Zhou
Title: SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
Abstract:
Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at https://github.com/zhoujiahuan1991/CVPR2025-SCAP.
Chinese: SCAP提出了一种新颖的归纳式测试时适应框架,通过基于视觉相似性构建支持性样本团并生成细粒度属性提示,有效整合跨测试批次的视觉与文本信息,显著提升了视觉语言模型在领域偏移下的泛化性能。
English: SCAP introduces a novel transductive test-time adaptation framework that leverages supportive cliques and fine-grained attribute prompts to enhance vision-language model performance under domain shifts by effectively integrating visual and textual information across test batches.

Authors:Duke Nguyen, Aditya Joshi, Flora Salim
Title: Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English
Abstract:
Test-time adaptation (TTA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English or Nigerian English, of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data. Our code is available at https://github.com/dukenguyenxyz/dialect-adaptation
中文摘要:测试时适配(TTA)可在无标注数据情况下有效泛化模型至不同方言分布,其中SHOT方法被证明可行且其效果与方言差异正相关,而基于标准美式英语的微调常优于方言数据。
English Summary: Test-time adaptation (TTA) effectively generalizes models across dialectal distributions without labeled data, with SHOT proving viable and its effectiveness positively correlated with dialectal gaps, while SAE-based finetuning often outperforms dialectal data.

Authors:Xian-Rong Zhang, Yue-Jiao Gong, Zhiguang Cao, Jun Zhang
Title: Island-Based Evolutionary Computation with Diverse Surrogates and Adaptive Knowledge Transfer for High-Dimensional Data-Driven Optimization
Abstract:
In recent years, there has been a growing interest in data-driven evolutionary algorithms (DDEAs) employing surrogate models to approximate the objective functions with limited data. However, current DDEAs are primarily designed for lower-dimensional problems and their performance drops significantly when applied to large-scale optimization problems (LSOPs). To address the challenge, this paper proposes an offline DDEA named DSKT-DDEA. DSKT-DDEA leverages multiple islands that utilize different data to establish diverse surrogate models, fostering diverse subpopulations and mitigating the risk of premature convergence. In the intra-island optimization phase, a semi-supervised learning method is devised to fine-tune the surrogates. It not only facilitates data argumentation, but also incorporates the distribution information gathered during the search process to align the surrogates with the evolving local landscapes. Then, in the inter-island knowledge transfer phase, the algorithm incorporates an adaptive strategy that periodically transfers individual information and evaluates the transfer effectiveness in the new environment, facilitating global optimization efficacy. Experimental results demonstrate that our algorithm is competitive with state-of-the-art DDEAs on problems with up to 1000 dimensions, while also exhibiting decent parallelism and scalability. Our DSKT-DDEA is open-source and accessible at: https://github.com/LabGong/DSKT-DDEA.
中文: 本文提出DSKT-DDEA离线数据驱动进化算法,通过多岛多样化代理模型和自适应知识转移策略,有效解决高达1000维的大规模优化问题,性能优于现有方法。
English: This paper introduces DSKT-DDEA, an offline data-driven evolutionary algorithm that uses multiple islands with diverse surrogate models and adaptive knowledge transfer to effectively tackle large-scale optimization problems up to 1000 dimensions, outperforming existing methods.

Authors:Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
Title: DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Abstract:
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/DeepPerception.
Chinese Summary: 本文提出DeepPerception模型,通过知识密集型视觉定位任务,结合认知推理框架与强化学习,显著提升了多模态大语言模型在细粒度视觉感知与领域知识融合方面的性能。
English Summary: This paper introduces DeepPerception, a Multimodal Large Language Model enhanced with cognitive visual perception to bridge the gap between expert knowledge and fine-grained visual discrimination through knowledge-intensive visual grounding.

Authors:Jianan Li, Huan Chen, Wangcai Zhao, Rui Chen, Tingfa Xu
Title: Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction
Abstract:
Hyperspectral Images (HSIs) are crucial across numerous fields but are hindered by the long acquisition times associated with traditional spectrometers. The Coded Aperture Snapshot Spectral Imaging (CASSI) system mitigates this issue through a compression technique that accelerates the acquisition process. However, reconstructing HSIs from compressed data presents challenges due to fixed spatial and spectral resolution constraints. This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction. We propose the Mixed Granularity Implicit Representation (MGIR) framework, which includes a Hierarchical Spectral-Spatial Implicit Encoder for efficient multi-scale implicit feature extraction. This is complemented by a Mixed-Granularity Local Feature Aggregator that adaptively integrates local features across scales, combined with a decoder that merges coordinate information for precise reconstruction. By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution, significantly enhancing the flexibility and adaptability of the CASSI system. Extensive experimental evaluations confirm that our model produces reconstructed images at arbitrary resolutions and matches state-of-the-art methods across varying spectral-spatial compression ratios. The code will be released at https://github.com/chh11/MGIR.
中文摘要:本研究提出混合粒度隐式表示(MGIR)框架,利用隐式神经表示实现任意空间-光谱分辨率的连续高光谱图像重建,克服了传统CASSI系统的固定分辨率限制,并在不同压缩比下保持了最先进的性能表现。
English Summary: This study introduces the Mixed Granularity Implicit Representation (MGIR) framework using implicit neural representation to enable continuous hyperspectral image reconstruction at arbitrary spatial-spectral resolutions, overcoming the fixed resolution limitations of traditional CASSI systems while maintaining state-of-the-art performance across various compression ratios.

Authors:Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, Ziran Wang
Title: NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Abstract:
Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at https://github.com/sungyeonparkk/NuPlanQA.
中文: 本文提出了用于驾驶场景理解的多模态评估基准NuPlanQA及其大规模数据集,并开发了集成鸟瞰特征的BEV-LLM模型,该模型在多数子任务中表现优异,同时揭示了该领域仍需解决的关键问题。
English: This paper introduces NuPlanQA, a multi-modal evaluation benchmark and dataset for driving scene understanding, and proposes BEV-LLM, a model integrating Bird's-Eye-View features that outperforms existing models in most subtasks while highlighting remaining challenges.

Authors:Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, Chuan Guo
Title: A Survey on Human Interaction Motion Generation
Abstract:
Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
Chinese Summary: 本综述首次系统梳理了人体交互动作生成领域,涵盖基础概念、解决方案与数据集,重点分析人-人、人-物、人-场景三类交互任务及评估标准,并展望了未来研究方向。
English Summary: This survey provides a comprehensive overview of human interaction motion generation, covering foundational concepts, existing solutions, datasets, and evaluation metrics across human-human, human-object, and human-scene interactions, while identifying future research directions.

Authors:Zibin Liu, Banglei Guan, Yang Shang, Yifei Bian, Pengju Sun, Qifeng Yu
Title: Stereo Event-based, 6-DOF Pose Tracking for Uncooperative Spacecraft
Abstract:
Pose tracking of uncooperative spacecraft is an essential technology for space exploration and on-orbit servicing, which remains an open problem. Event cameras possess numerous advantages, such as high dynamic range, high temporal resolution, and low power consumption. These attributes hold the promise of overcoming challenges encountered by conventional cameras, including motion blur and extreme illumination, among others. To address the standard on-orbit observation missions, we propose a line-based pose tracking method for uncooperative spacecraft utilizing a stereo event camera. To begin with, we estimate the wireframe model of uncooperative spacecraft, leveraging the spatio-temporal consistency of stereo event streams for line-based reconstruction. Then, we develop an effective strategy to establish correspondences between events and projected lines of uncooperative spacecraft. Using these correspondences, we formulate the pose tracking as a continuous optimization process over 6-DOF motion parameters, achieved by minimizing event-line distances. Moreover, we construct a stereo event-based uncooperative spacecraft motion dataset, encompassing both simulated and real events. The proposed method is quantitatively evaluated through experiments conducted on our self-collected dataset, demonstrating an improvement in terms of effectiveness and accuracy over competing methods. The code will be open-sourced at https://github.com/Zibin6/SE6PT.
中文摘要:本研究提出了一种基于立体事件相机的非合作航天器线型姿态跟踪方法,通过连续优化六自由度运动参数,在精度和效果上优于现有方法。
English Summary: This study introduces a stereo event camera-based line pose tracking method for uncooperative spacecraft, which uses continuous optimization of 6-DOF motion parameters to achieve enhanced accuracy and effectiveness over existing approaches.

Authors:Javier Tirado-Garín, Javier Civera
Title: AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Abstract:
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited -- cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at https://github.com/javrtg/AnyCalib.
中文: AnyCalib是一种新颖方法,仅通过单张图像中的透视和畸变线索即可校准相机内参,无需依赖特定相机模型或外部参照,在多种相机模型上均优于现有方法且训练数据需求极少。
English: AnyCalib is a novel method that calibrates camera intrinsic parameters from a single image using inherent perspective and distortion cues, outperforming existing approaches across various camera models with minimal training data.

Authors:Alex Bercik, David A. Craig Penner, David W. Zingg
Title: Stable Volume Dissipation for High-Order Finite-Difference and Spectral-Element Methods with the Summation-by-Parts Property
Abstract:
The construction of stable, conservative, and accurate volume dissipation is extended to discretizations that possess a generalized summation-by-parts (SBP) property within a tensor-product framework. The dissipation operators can be applied to any finite-difference or spectral-element scheme that uses the SBP framework, including high-order entropy-stable schemes. Additionally, we clarify the incorporation of a variable coefficient within the operator structure and analyze the impact of a boundary correction matrix on operator structure and accuracy. Following the theoretical development and construction of novel dissipation operators, we relate the presented volume dissipation to the use of upwind SBP operators. When applied to spectral-element methods, the presented approach yields unique dissipation operators that can also be derived through alternative approaches involving orthogonal polynomials. Numerical examples featuring the linear convection, Burgers, and Euler equations verify the properties of the constructed dissipation operators and assess their performance compared to existing upwind SBP schemes, including linear stability behaviour. When applied to entropy-stable schemes, the presented approach results in accurate and robust methods that can solve a broader range of problems where comparable existing methods fail.
中文摘要:本研究将稳定保守的体积耗散构造推广至广义分部求和框架,开发的新型耗散算子提升了熵稳定格式在多类方程求解中的精度与鲁棒性。
English Summary: This study extends stable and conservative volume dissipation to generalized summation-by-parts discretizations, developing novel operators that enhance accuracy and robustness in entropy-stable schemes for various equations.

Authors:Liangyu Wang, Jie Ren, Hang Xu, Junxiao Wang, Huanyi Xie, David E. Keyes, Di Wang
Title: ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
Abstract:
Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it's feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO's double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU--achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2's code has been open-sourced in https://github.com/liangyuwang/zo2.
Chinese: ZO2框架通过动态在CPU和GPU间迁移模型参数,实现了大语言模型的高效零阶微调,仅需18GB显存即可训练OPT-175B等超大规模模型,且不增加时间开销或损失精度。
English: The ZO2 framework enables efficient zeroth-order fine-tuning of large language models by dynamically offloading parameters between CPU and GPU, allowing models like OPT-175B to be trained on just 18GB of GPU memory without time or accuracy penalties.

Authors:Jacob Chmura, Jonah Dauvet, Sebastian Sabry
Title: Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility
Abstract:
Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at https://github.com/Jacob-Chmura/plausibility-vaccine.
中文: 本研究通过参数高效微调将大语言模型的潜在知识注入,在预训练的AlBERT嵌入上使用适配器融合技术,并自动生成辅助任务数据,从而提升了两个合理性数据集中的事件合理性预测能力。
English: This study enhances plausibility prediction by integrating latent knowledge from large language models through parameter-efficient fine-tuning, using adapter fusion on pre-trained AlBERT embeddings and automating auxiliary task data generation across two datasets.

Authors:Imran Kabir, Md Alimoor Reza, Syed Billah
Title: Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding
Abstract:
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.
Chinese: Logic-RAG是一种新颖的检索增强生成框架,通过一阶逻辑构建动态知识库,显著提升多模态大模型在自动驾驶中的空间推理能力,有效解决了现有模型在细粒度空间理解上的不足。
English: Logic-RAG is a novel Retrieval-Augmented Generation framework that enhances large multimodal models' spatial reasoning in autonomous driving by constructing a dynamic knowledge base with first-order logic, significantly improving accuracy on visual-spatial queries.

Authors:Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, Nikhil Pakhale, Lownish Rai Sookha, M. A. Ganaie, Abhinav Dhall
Title: MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
Abstract:
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW
Chinese: 提出的MAVEN模型通过跨模态注意力机制整合多模态线索,在Aff-Wild2数据集上实现了优于传统方法的动态情绪识别性能。
English: The proposed MAVEN model enhances dynamic emotion recognition in the wild by integrating multi-modal cues through a cross-modal attention mechanism, achieving superior performance on the Aff-Wild2 dataset compared to traditional methods.

Authors:Yitian Shi, Di Wen, Guanqi Chen, Edgar Welte, Sheng Liu, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes
Title: VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility
Abstract:
We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of $87.5\%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact
VISO-Grasp提出了一种新颖的视觉语言系统,利用基础模型进行空间推理和主动视角规划,在严重遮挡环境下通过实时优化抓取策略实现了87.5%的成功率。
VISO-Grasp introduces a vision-language system using Foundation Models for spatial reasoning and active view planning to achieve robust grasping in highly occluded scenes, achieving 87.5% success rate with optimized strategies.

Authors:Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, Hao Fei
Title: Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Abstract:
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
中文摘要:多模态思维链(MCoT)推理将分步推理扩展至多种数据类型,在各类应用中取得显著成果,但仍需系统性综述以应对当前挑战并推动多模态通用人工智能的发展。
English Summary: Multimodal chain-of-thought (MCoT) reasoning extends step-by-step reasoning to diverse data types, achieving success in various applications while facing challenges that require systematic review and future innovation toward multimodal AGI.

Authors:Xiaoyu Han, Shengping Zhang, Qinglin Liu, Zonglin Li, Chenyang Wang
Title: Progressive Limb-Aware Virtual Try-On
Abstract:
Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively. The code is available at https://github.com/xyhanHIT/PL-VTON.
中文: 现有虚拟试衣方法因未利用服装属性优化几何纹理且忽略肢体区域预测,导致衣物模糊和肢体不准确,而PL-VTON通过多属性渐进式变形和显式肢体特征融合,实现了更高质量的逼真试衣效果。
English: Current virtual try-on methods often produce incomplete and blurred clothing appearances due to neglecting clothing attributes for refinement and inaccurate limb predictions when switching sleeve lengths, which PL-VTON addresses through progressive pixel-level warping and limb-aware feature integration for superior photo-realistic results.

Authors:Zhiwei He, Zhaopeng Tu, Xing Wang, Xingyu Chen, Zhijie Wang, Jiahao Xu, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang
Title: RaSA: Rank-Sharing Low-Rank Adaptation
Abstract:
Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: https://github.com/zwhe99/RaSA.
中文: RaSA通过跨层部分秩共享增强了LoRA的表达能力,在不增加参数的情况下有效提升秩数,显著提高了代码生成和数学推理等复杂任务的性能。
English: RaSA enhances LoRA's expressive capacity by implementing partial rank sharing across layers, effectively increasing ranks without additional parameters and significantly improving performance in demanding tasks like code generation and mathematical reasoning.

Authors:Yancheng Wang, Changyu Liu, Yingzhen Yang
Title: Diffusion on Graph: Augmentation of Graph Structure for Node Classification
Abstract:
Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at https://github.com/Statistical-Deep-Learning/DoG.
中文: 提出的图扩散方法(DoG)通过生成合成图结构来增强原始图以支持节点级学习任务,采用双级邻居映射解码器提高效率,并引入低秩正则化减少噪声影响,实验在半监督节点分类和图对比学习中验证了其有效性。
English: The proposed Diffusion on Graph (DoG) method generates synthetic graph structures to augment original graphs for node-level learning tasks, incorporating a Bi-Level Neighbor Map Decoder for efficiency and low-rank regularization to reduce noise impact, with experiments validating its effectiveness in semi-supervised node classification and graph contrastive learning.

Authors:Ruopeng Gao, Yuyao Wang, Chunxu Liu, Limin Wang
Title: History-Aware Transformation of ReID Features for Multiple Object Tracking
Abstract:
The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at https://github.com/HELLORPG/HATReID-MOT.
中文: 本文提出了一种多目标跟踪中重识别特征的历史感知变换方法,通过定制化的Fisher线性判别分析增强轨迹间的特征区分度,无需训练即可实现具有竞争力的跟踪性能。
English: This paper proposes a history-aware transformation method for re-identification features in multiple object tracking, using a tailored Fisher Linear Discriminant to enhance feature discrimination between trajectories and achieve competitive tracking performance without training.

Authors:Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Title: AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.
Chinese: AdaReTaKe是一种无需训练的方法,通过自适应地减少时间和模型层间的视觉冗余,使多模态大语言模型能够处理多达2048帧的视频,并在基准数据集上以高达6.0%的优势超越现有方法。
English: AdaReTaKe is a training-free method that adaptively reduces visual redundancy across time and model layers, enabling multimodal large language models to process up to 2048 frames while outperforming existing methods by up to 6.0% on benchmark datasets.

Authors:Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, Kaizhu Huang
Title: BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis
Abstract:
3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. Code is available at https://github.com/weiguangzhao/BFANet.
中文: 本文提出BFANet这一新型3D语义分割网络,通过分析语义边界特征并设计四种针对性误差指标,有效解决了以往被忽视的困难区域分割问题,实验证明其优越性能。
English: This paper introduces BFANet, a novel 3D semantic segmentation network that addresses overlooked challenging regions by analyzing semantic boundary features and proposing four new error-specific metrics, demonstrating superior performance through extensive experiments.

Authors:Fanbin Lu, Zhisheng Zhong, Ziqin Wei, Shu Liu, Chi-Wing Fu, Jiaya Jia
Title: STEVE: A Step Verification Pipeline for Computer-use Agent Training
Abstract:
Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.
中文: 为解决图形用户界面操作中AI代理训练的可扩展性难题,STEVE流程采用GPT-4o进行轨迹步骤验证,并通过卡尼曼-特沃斯基优化方法训练代理,在实时桌面环境中实现了领先的性能与成本效益。
English: To address the challenges of training scalable AI agents for graphical user interface manipulation, the STEVE pipeline employs a step verification method using GPT-4o to label trajectory steps and optimizes agents with Kahneman and Tversky Optimization, achieving superior performance and efficiency in live desktop environments.

Authors:Yang Yi, Kunqing Wang, Jinpu Zhang, Zhen Tan, Xiangke Wang, Hui Shen, Dewen Hu
Title: A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry
Abstract:
The bias of low-cost Inertial Measurement Units (IMU) is a critical factor affecting the performance of Visual-Inertial Odometry (VIO). In particular, when visual tracking encounters errors, the optimized bias results may deviate significantly from the true values, adversely impacting the system's stability and localization precision. In this paper, we propose a novel plug-and-play framework featuring the Inertial Prior Network (IPNet), which is designed to accurately estimate IMU bias. Recognizing the substantial impact of initial bias errors in low-cost inertial devices on system performance, our network directly leverages raw IMU data to estimate the mean bias, eliminating the dependency on historical estimates in traditional recursive predictions and effectively preventing error propagation. Furthermore, we introduce an iterative approach to calculate the mean value of the bias for network training, addressing the lack of bias labels in many visual-inertial datasets. The framework is evaluated on two public datasets and one self-collected dataset. Extensive experiments demonstrate that our method significantly enhances both localization precision and robustness, with the ATE-RMSE metric improving on average by 46\%. The source code and video will be available at \textcolor{red}{https://github.com/yiyscut/VIO-IPNet.git}.
中文: 本文提出了一种带有惯性先验网络(IPNet)的即插即用框架,可直接从原始数据估计IMU偏差,将视觉惯性里程计的定位精度平均提升46%,并显著增强系统鲁棒性。
English: This paper introduces a plug-and-play framework with an Inertial Prior Network (IPNet) that directly estimates IMU bias from raw data, enhancing VIO localization precision by 46% on average and improving system robustness.

Authors:Patryk Marszałek, Ulvi Movsum-zada, Oleksii Furman, Kamil Książek, Przemysław Spurek, Marek Śmieja
Title: HyConEx: Hypernetwork classifier with counterfactual explanations
Abstract:
In recent years, there has been a growing interest in explainable AI methods. We want not only to make accurate predictions using sophisticated neural networks but also to understand what the model's decision is based on. One of the fundamental levels of interpretability is to provide counterfactual examples explaining the rationale behind the decision and identifying which features, and to what extent, must be modified to alter the model's outcome. To address these requirements, we introduce HyConEx, a classification model based on deep hypernetworks specifically designed for tabular data. Owing to its unique architecture, HyConEx not only provides class predictions but also delivers local interpretations for individual data samples in the form of counterfactual examples that steer a given sample toward an alternative class. While many explainable methods generated counterfactuals for external models, there have been no interpretable classifiers simultaneously producing counterfactual samples so far. HyConEx achieves competitive performance on several metrics assessing classification accuracy and fulfilling the criteria of a proper counterfactual attack. This makes HyConEx a distinctive deep learning model, which combines predictions and explainers as an all-in-one neural network. The code is available at https://github.com/gmum/HyConEx.
Chinese: HyConEx 是一种基于深度超网络的创新分类模型,专为表格数据设计,不仅能提供精确预测,还能生成反事实样本实现局部可解释性,在分类准确性与可解释性指标上均表现出竞争力。
English: HyConEx is a novel deep hypernetwork-based classifier for tabular data that simultaneously delivers accurate predictions and generates counterfactual examples for local interpretability, achieving competitive performance in both classification and explainability metrics.

Authors:Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, Jiaya Jia
Title: Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Abstract:
The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain. Our benchmark and evaluation codes has been released at: https://github.com/dvlab-research/LSDBench
中文摘要:LSDBench作为首个针对长视频任务评估大型视觉语言模型的基准,通过构建高必要采样密度问题来应对采样困境,同时提出的推理驱动分层采样框架通过结合全局定位与局部密集采样,实现了对长视频的高效处理。
English Summary: LSDBench is introduced as the first benchmark to evaluate Large Vision-Language Models on long-video tasks by creating high Necessary Sampling Density questions, while the Reasoning-Driven Hierarchical Sampling framework is proposed to efficiently process these videos by combining global localization with local dense sampling.

Authors:Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
Title: CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
Abstract:
Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.
中文摘要:本文提出CAKE方法,通过空间和时间维度的注意力动态分析,以层间级联方式自适应分配KV缓存资源,在极低缓存条件下保持模型性能,并实现解码延迟的数量级提升。
English Summary: The paper introduces CAKE, an adaptive KV cache eviction method that optimizes memory allocation across layers by considering spatial and temporal attention dynamics, achieving high performance with minimal cache while significantly reducing decoding latency.

Authors:Wenbo Dai, Lijing Lu, Zhihang Li
Title: Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification
Abstract:
The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: https://github.com/BorgDiven/DiVE
中文: DiVE框架通过解耦身份与模态生成合成的RGB-IR配对图像,在降低数据收集成本的同时显著提升了可见光-红外行人重识别模型的性能。
English: The DiVE framework generates synthetic RGB-IR paired images by decoupling identity and modality, significantly enhancing VI-ReID model performance while reducing data collection costs.

Authors:Han Mei, Kunqian Li, Shuaixin Liu, Chengzhi Ma, Qianli Jiang
Title: DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement
Abstract:
Due to the complex interplay of light absorption and scattering in the underwater environment, underwater images experience significant degradation. This research presents a two-stage underwater image enhancement network called the Data-Driven and Physical Parameters Fusion Network (DPF-Net), which harnesses the robustness of physical imaging models alongside the generality and efficiency of data-driven methods. We first train a physical parameter estimate module using synthetic datasets to guarantee the trustworthiness of the physical parameters, rather than solely learning the fitting relationship between raw and reference images by the application of the imaging equation, as is common in prior studies. This module is subsequently trained in conjunction with an enhancement network, where the estimated physical parameters are integrated into a data-driven model within the embedding space. To maintain the uniformity of the restoration process amid underwater imaging degradation, we propose a physics-based degradation consistency loss. Additionally, we suggest an innovative weak reference loss term utilizing the entire dataset, which alleviates our model's reliance on the quality of individual reference images. Our proposed DPF-Net demonstrates superior performance compared to other benchmark methods across multiple test sets, achieving state-of-the-art results. The source code and pre-trained models are available on the project home page: https://github.com/OUCVisionGroup/DPF-Net.
Chinese: 本研究提出DPF-Net双阶段水下图像增强网络,通过融合物理成像模型与数据驱动方法,采用基于物理的退化一致性损失和数据集级弱参考损失,实现了最优性能表现。
English: This study introduces DPF-Net, a two-stage underwater image enhancement network that integrates physical imaging models with data-driven methods, achieving state-of-the-art performance by employing physics-based consistency loss and dataset-wide weak reference loss.

Authors:Jiahang Cao, Qiang Zhang, Hanzhong Guo, Jiaxu Wang, Hao Cheng, Renjing Xu
Title: Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition
Abstract:
Diffusion Policy (DP) has attracted significant attention as an effective method for policy representation due to its capacity to model multi-distribution dynamics. However, current DPs are often based on a single visual modality (e.g., RGB or point cloud), limiting their accuracy and generalization potential. Although training a generalized DP capable of handling heterogeneous multimodal data would enhance performance, it entails substantial computational and data-related costs. To address these challenges, we propose a novel policy composition method: by leveraging multiple pre-trained DPs based on individual visual modalities, we can combine their distributional scores to form a more expressive Modality-Composable Diffusion Policy (MCDP), without the need for additional training. Through extensive empirical experiments on the RoboTwin dataset, we demonstrate the potential of MCDP to improve both adaptability and performance. This exploration aims to provide valuable insights into the flexible composition of existing DPs, facilitating the development of generalizable cross-modality, cross-domain, and even cross-embodiment policies. Our code is open-sourced at https://github.com/AndyCao1125/MCDP.
Chinese: 提出的模态可组合扩散策略(MCDP)无需重新训练即可融合多个预训练的视觉模态扩散策略,在RoboTwin数据集上验证了其提升异构数据适应性与性能的有效性。
English: The proposed Modality-Composable Diffusion Policy (MCDP) combines pre-trained diffusion policies from individual visual modalities without retraining, enhancing adaptability and performance across heterogeneous data as validated on the RoboTwin dataset.

Authors:Alessio Xompero, Andrea Cavallaro
Title: Learning Privacy from Visual Entities
Abstract:
Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.
中文: 研究表明,仅用732个参数的迁移学习与CNN结合方法即可实现与复杂图神经网络相当的隐私分类性能,同时揭示高维特征和图结构组件对优化结果并非必要。
English: This study demonstrates that a simpler approach using transfer learning and CNNs with only 732 parameters achieves privacy classification performance comparable to complex graph-based methods, while revealing that high-dimensional features and graph components are unnecessary for optimal results.

Authors:Fanhu Zeng, Hao Tang, Yihua Shao, Siyu Chen, Ling Shao, Yan Wang
Title: MambaIC: State Space Models for High-Performance Learned Image Compression
Abstract:
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at https://github.com/AuroraZengfh/MambaIC.
中文摘要:本文提出MambaIC这一增强型图像压缩方法,通过状态空间模型优化上下文建模并引入局部注意力机制,在提升计算效率的同时有效减少空间冗余。
English Summary: The paper introduces MambaIC, an enhanced image compression method that leverages state space models to improve computational efficiency and reduce redundancy through refined context modeling and window-based local attention.

Authors:Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang
Title: LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching
Abstract:
Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in https://github.com/feihongyan1/LazyMAR.
中文: 掩码自回归(MAR)模型因双向自注意力与KV缓存机制冲突而产生计算瓶颈,本文提出LazyMAR方法,利用标记冗余和条件冗余实现免训练的即插即用优化,在保持生成质量的同时将速度提升2.83倍。
English: Masked Autoregressive (MAR) models face computational bottlenecks due to their reliance on bidirectional self-attention conflicting with KV caching, prompting the development of LazyMAR, a training-free solution that leverages token and condition redundancies to achieve a 2.83x speedup without compromising generation quality.

Authors:Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong
Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
Abstract:
The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at https://github.com/hon9kon9ize/hkeval2025
中文摘要:HKCanto-Eval基准通过融入香港文化特色评估大语言模型的粤语理解能力,研究发现专有模型虽优于开源模型,但在处理粤语特定知识方面仍存在明显不足。
English Summary: The HKCanto-Eval benchmark evaluates large language models' performance on Cantonese language understanding with Hong Kong cultural nuances, revealing that proprietary models outperform open-weight ones but still struggle with Cantonese-specific knowledge.

Authors:Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, Liang He
Title: A Survey on the Optimization of Large Language Model-based Agents
Abstract:
With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks. However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness or suboptimal performance in complex agent-related environments. Although LLM optimization techniques can improve model performance across many general tasks, they lack specialized optimization towards critical agent functionalities such as long-term planning, dynamic environmental interaction, and complex decision-making. Although numerous recent studies have explored various strategies to optimize LLM-based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective is still lacking. In this survey, we provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods. We first focus on parameter-driven optimization, covering fine-tuning-based optimization, reinforcement learning-based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, fine-tuning techniques, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter-free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the datasets and benchmarks used for evaluation and tuning, review key applications of LLM-based agents, and discuss major challenges and promising future directions. Our repository for related references is available at https://github.com/YoungDubbyDu/LLM-Agent-Optimization.
中文: 本综述系统梳理了基于大语言模型的智能体优化方法,将其分为参数驱动与无参数策略,并分析了关键技术、评估基准、应用场景及未来挑战。
English: This survey comprehensively reviews optimization methods for LLM-based agents, categorizing them into parameter-driven and parameter-free approaches while analyzing key techniques, evaluation benchmarks, applications, and future challenges.

Authors:Luming Wang, Hao Shi, Xiaoting Yin, Kailun Yang, Kaiwei Wang, Jian Bai
Title: EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera
Abstract:
Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BTSM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture.
中文: 本文提出了一种专为事件相机设计的轻量级网络架构,通过非对称卷积和状态空间模型分离头部运动噪声,在自定义和公共数据集上以较少参数实现了更高的手势识别准确率。
English: This paper introduces a lightweight network architecture for egocentric gesture recognition using event cameras, featuring asymmetric convolutions and a state-space model to decouple head movement noise, achieving superior accuracy with minimal parameters on both custom and public datasets.

Authors:Shuo Gao, Jingyang Zhang, Jun Xue, Meng Yang, Yang Chen, Guangquan Zhou
Title: A Causality-Inspired Model for Intima-Media Thickening Assessment in Ultrasound Videos
Abstract:
Carotid atherosclerosis represents a significant health risk, with its early diagnosis primarily dependent on ultrasound-based assessments of carotid intima-media thickening. However, during carotid ultrasound screening, significant view variations cause style shifts, impairing content cues related to thickening, such as lumen anatomy, which introduces spurious correlations that hinder assessment. Therefore, we propose a novel causal-inspired method for assessing carotid intima-media thickening in frame-wise ultrasound videos, which focuses on two aspects: eliminating spurious correlations caused by style and enhancing causal content correlations. Specifically, we introduce a novel Spurious Correlation Elimination (SCE) module to remove non-causal style effects by enforcing prediction invariance with style perturbations. Simultaneously, we propose a Causal Equivalence Consolidation (CEC) module to strengthen causal content correlation through adversarial optimization during content randomization. Simultaneously, we design a Causal Transition Augmentation (CTA) module to ensure smooth causal flow by integrating an auxiliary pathway with text prompts and connecting it through contrastive learning. The experimental results on our in-house carotid ultrasound video dataset achieved an accuracy of 86.93\%, demonstrating the superior performance of the proposed method. Code is available at \href{https://github.com/xielaobanyy/causal-imt}{https://github.com/xielaobanyy/causal-imt}.
中文: 本研究提出了一种新颖的因果启发方法,通过消除超声图像风格变化导致的伪相关并增强因果内容关联来评估颈动脉内膜中层厚度,在内部数据集上达到了86.93%的准确率。
English: This study introduces a novel causal-inspired method for assessing carotid intima-media thickening in ultrasound videos, which eliminates spurious correlations from style variations and enhances causal content cues through specialized modules, achieving 86.93% accuracy on an in-house dataset.

Authors:Jiangdong Cai, Yan Chen, Zhenrong Shen, Haotian Jiang, Honglin Xiong, Kai Xuan, Lichi Zhang, Qian Wang
Title: Pathology Image Restoration via Mixture of Prompts
Abstract:
In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at https://github.com/caijd2000/MoP.
中文摘要:本研究提出了一种结合Transformer和扩散模型的两阶段方法,通过创新的提示混合技术,从单焦平面扫描中恢复高质量病理图像,有效解决散焦和语义复杂性,具有临床应用潜力。
English Summary: This study introduces a two-stage method combining a transformer and a diffusion model, enhanced by a novel mixture of prompts, to restore high-quality pathology images from single-focal-plane scans, addressing defocus and semantic complexities for clinical applications.

Authors:Heng Zhang, Guoxiang Zhao, Xiaoqiang Ren
Title: TERL: Large-Scale Multi-Target Encirclement Using Transformer-Enhanced Reinforcement Learning
Abstract:
Pursuit-evasion (PE) problem is a critical challenge in multi-robot systems (MRS). While reinforcement learning (RL) has shown its promise in addressing PE tasks, research has primarily focused on single-target pursuit, with limited exploration of multi-target encirclement, particularly in large-scale settings. This paper proposes a Transformer-Enhanced Reinforcement Learning (TERL) framework for large-scale multi-target encirclement. By integrating a transformer-based policy network with target selection, TERL enables robots to adaptively prioritize targets and safely coordinate robots. Results show that TERL outperforms existing RL-based methods in terms of encirclement success rate and task completion time, while maintaining good performance in large-scale scenarios. Notably, TERL, trained on small-scale scenarios (15 pursuers, 4 targets), generalizes effectively to large-scale settings (80 pursuers, 20 targets) without retraining, achieving a 100% success rate. The code and demonstration video are available at https://github.com/ApricityZ/TERL.
中文: 本文提出了一种基于Transformer增强的强化学习框架(TERL),使多机器人系统能够在大规模场景中有效围捕多个目标,展现出更高的成功率和无需重新训练的良好泛化能力。
English: This paper introduces a Transformer-Enhanced Reinforcement Learning (TERL) framework that enables multi-robot systems to efficiently encircle multiple targets in large-scale scenarios, demonstrating superior success rates and generalization without retraining.

Authors:Yutao Hu, Sen Li, Jincheng Yan, Wenqi Shao, Xiaoyan Luo
Title: Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset
Abstract:
Fine-grained visual categorization (FGVC) is a challenging but significant task in computer vision, which aims to recognize different sub-categories of birds, cars, airplanes, etc. Among them, recognizing models of different cars has significant application value in autonomous driving, traffic surveillance and scene understanding, which has received considerable attention in the past few years. However, Stanford-Car, the most widely used fine-grained dataset for car recognition, only has 196 different categories and only includes vehicle models produced earlier than 2013. Due to the rapid advancements in the automotive industry during recent years, the appearances of various car models have become increasingly intricate and sophisticated. Consequently, the previous Stanford-Car dataset fails to capture this evolving landscape and cannot satisfy the requirements of automotive industry. To address these challenges, in our paper, we introduce Car-1000, a large-scale dataset designed specifically for fine-grained visual categorization of diverse car models. Car-1000 encompasses vehicles from 165 different automakers, spanning a wide range of 1000 distinct car models. Additionally, we have reproduced several state-of-the-art FGVC methods on the Car-1000 dataset, establishing a new benchmark for research in this field. We hope that our work will offer a fresh perspective for future FGVC researchers. Our dataset is available at https://github.com/toggle1995/Car-1000.
Chinese: 为解决斯坦福汽车数据集过时的问题,作者推出了包含1000种车型的大规模数据集Car-1000,为细粒度视觉分类研究建立了新基准。
English: The authors introduce Car-1000, a large-scale dataset with 1000 car models to address the limitations of the outdated Stanford-Car dataset and advance fine-grained visual categorization research, establishing new benchmarks for the field.

Authors:Kang You, Tong Chen, Dandan Ding, M. Salman Asif, Zhan Ma
Title: RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds
Abstract:
Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels' occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25% and 48.34% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code is available at https://github.com/NJUVISION/RENO.
中文: RENO是首个实时3D激光雷达点云神经编解码器,通过1MB轻量模型在实现10 fps实时处理的同时,显著降低了比特率。
English: RENO is the first real-time neural codec for 3D LiDAR point clouds, achieving significant bit-rate savings with a lightweight 1MB model while enabling real-time processing at 10 fps.

Authors:Syed Rifat Raiyan, Md. Hasanul Kabir
Title: SCReedSolo: A Secure and Robust LSB Image Steganography Framework with Randomized Symmetric Encryption and Reed-Solomon Coding
Abstract:
Image steganography is an information-hiding technique that involves the surreptitious concealment of covert informational content within digital images. In this paper, we introduce ${\rm SCR{\small EED}S{\small OLO}}$, a novel framework for concealing arbitrary binary data within images. Our approach synergistically leverages Random Shuffling, Fernet Symmetric Encryption, and Reed-Solomon Error Correction Codes to encode the secret payload, which is then discretely embedded into the carrier image using LSB (Least Significant Bit) Steganography. The combination of these methods addresses the vulnerability vectors of both security and resilience against bit-level corruption in the resultant stego-images. We show that our framework achieves a data payload of 3 bits per pixel for an RGB image, and mathematically assess the probability of successful transmission for the amalgamated $n$ message bits and $k$ error correction bits. Additionally, we find that ${\rm SCR{\small EED}S{\small OLO}}$ yields good results upon being evaluated with multiple performance metrics, successfully eludes detection by various passive steganalysis tools, and is immune to simple active steganalysis attacks. Our code and data are available at https://github.com/Starscream-11813/SCReedSolo-Steganography.
中文: 本文提出SCReedSolo这一新型图像隐写框架,通过结合加密和纠错技术,利用LSB隐写安全地嵌入二进制数据,实现了高有效载荷并有效抵御检测和攻击。
English: This paper presents SCReedSolo, a novel image steganography framework that combines encryption and error correction to securely embed binary data in images using LSB steganography, achieving high payload capacity and resistance to detection and attacks.

Authors:Kumar Krishna Agrawal, Long Lian, Longchao Liu, Natalia Harguindeguy, Boyi Li, Alexander Bick, Maggie Chung, Trevor Darrell, Adam Yala
Title: Atlas: Multi-Scale Attention Improves Long Context Image Modeling
Abstract:
Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at https://github.com/yalalab/atlas.
中文: 摘要介绍了多尺度注意力机制及其Atlas架构,该架构在高分辨率图像建模中显著提升了计算效率与准确性,性能超越了ConvNext-B和FasterViT等现有模型。
English: The abstract introduces Multi-Scale Attention (MSA) and the Atlas architecture, which significantly enhances computational efficiency and accuracy in high-resolution image modeling, outperforming existing models like ConvNext-B and FasterViT.

Authors:Wenqing Kuang, Xiongwei Zhao, Yehui Shen, Congcong Wen, Huimin Lu, Zongtan Zhou, Xieyuanli Chen
Title: ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions
Abstract:
LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.
中文: ResLPRNet是一种基于小波变换的轻量级激光雷达修复网络,可显著提升恶劣天气下的地点识别性能,并通过新基准在多个数据集上验证了其显著效果。
English: ResLPRNet is a lightweight LiDAR restoration network using wavelet transforms to enhance place recognition performance under adverse weather, with a new benchmark showing significant gains across multiple datasets.

Authors:Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Abstract:
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
中文:提出的CTCL框架通过结合差分隐私微调的轻量生成器和基于聚类的主题模型,有效生成隐私保护合成数据,无需大量计算或人工提示即可克服现有方法的局限。
English: The proposed CTCL framework generates privacy-preserving synthetic data by combining a differentially private fine-tuned lightweight generator with a clustering-based topic model, effectively overcoming limitations of existing methods without requiring extensive computation or manual prompts.

Authors:Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang
Title: SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression
Abstract:
Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM
中文: SVD-LLM V2 通过采用分层压缩比和损失优化截断技术改进SVD压缩,在多个模型与数据集上超越了现有压缩方法。
English: SVD-LLM V2 enhances LLM compression by optimizing singular value truncation with layer-specific ratios and loss-minimizing techniques, outperforming existing methods across multiple models and datasets.

Authors:Zhe Wang, Yanjun Qi
Title: Augmented Adversarial Trigger Learning
Abstract:
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning
中文: ATLA提出了一种增强的对抗性触发器学习方法,通过优化响应格式令牌并抑制回避性回答,以80%更少的查询实现近100%的攻击成功率,同时在跨模型场景中展现出强大的泛化能力。
English: ATLA introduces an augmented adversarial trigger learning method that enhances attack efficiency by optimizing response format tokens and suppressing evasive responses, achieving nearly 100% success with 80% fewer queries while demonstrating strong generalization across models.

Authors:Tengfei Wang, Xin Wang, Yongmao Hou, Zhaoning Zhang, Yiwei Xu, Zongqian Zhan
Title: GS-I$^{3}$: Gaussian Splatting for Surface Reconstruction from Illumination-Inconsistent Images
Abstract:
Accurate geometric surface reconstruction, providing essential environmental information for navigation and manipulation tasks, is critical for enabling robotic self-exploration and interaction. Recently, 3D Gaussian Splatting (3DGS) has gained significant attention in the field of surface reconstruction due to its impressive geometric quality and computational efficiency. While recent relevant advancements in novel view synthesis under inconsistent illumination using 3DGS have shown promise, the challenge of robust surface reconstruction under such conditions is still being explored. To address this challenge, we propose a method called GS-3I. Specifically, to mitigate 3D Gaussian optimization bias caused by underexposed regions in single-view images, based on Convolutional Neural Network (CNN), a tone mapping correction framework is introduced. Furthermore, inconsistent lighting across multi-view images, resulting from variations in camera settings and complex scene illumination, often leads to geometric constraint mismatches and deviations in the reconstructed surface. To overcome this, we propose a normal compensation mechanism that integrates reference normals extracted from single-view image with normals computed from multi-view observations to effectively constrain geometric inconsistencies. Extensive experimental evaluations demonstrate that GS-3I can achieve robust and accurate surface reconstruction across complex illumination scenarios, highlighting its effectiveness and versatility in this critical challenge. https://github.com/TFwang-9527/GS-3I
中文: 提出的GS-3I方法通过引入基于CNN的色调映射校正和法线补偿机制,增强了3D高斯泼溅技术,实现在不一致光照条件下的鲁棒表面重建。
English: The proposed GS-3I method enhances 3D Gaussian Splatting by incorporating a CNN-based tone mapping correction and a normal compensation mechanism to achieve robust surface reconstruction under inconsistent illumination conditions.

Authors:Yunze Liu, Peiran Wu, Cheng Liang, Junxiao Shen, Limin Wang, Li Yi
Title: VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
Abstract:
Recent Mamba-based architectures for video understanding demonstrate promising computational efficiency and competitive performance, yet struggle with overfitting issues that hinder their scalability. To overcome this challenge, we introduce VideoMAP, a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. VideoMAP uses a 4:1 Mamba-to-Transformer ratio, effectively balancing computational cost and model capacity. This architecture, combined with our proposed frame-wise masked autoregressive pre-training strategy, delivers significant performance gains when scaling to larger models. Additionally, VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data. Experiments show that VideoMAP outperforms existing models across various datasets, including Kinetics-400, Something-Something V2, Breakfast, and COIN. Furthermore, we demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models, highlighting its ability to reduce memory usage and enable the processing of longer video sequences. The code is open-source at https://github.com/yunzeliu/MAP
中文: VideoMAP是一种混合Mamba-Transformer框架,通过创新的预训练策略解决视频理解中的过拟合问题,在多个数据集上实现卓越性能,同时显著提升计算效率和扩展性。
English: VideoMAP is a hybrid Mamba-Transformer framework that addresses overfitting in video understanding through a novel pre-training strategy, achieving superior performance across multiple datasets while enhancing computational efficiency and scalability.

Authors:Jiahao Wu, Rui Peng, Zhiyan Wang, Lu Xiao, Luyang Tang, Jinbo Yan, Kaiqiang Xiong, Ronggang Wang
Title: Swift4D:Adaptive divide-and-conquer Gaussian Splatting for compact and efficient reconstruction of dynamic scene
Abstract:
Novel view synthesis has long been a practical but challenging task, although the introduction of numerous methods to solve this problem, even combining advanced representations like 3D Gaussian Splatting, they still struggle to recover high-quality results and often consume too much storage memory and training time. In this paper we propose Swift4D, a divide-and-conquer 3D Gaussian Splatting method that can handle static and dynamic primitives separately, achieving a good trade-off between rendering quality and efficiency, motivated by the fact that most of the scene is the static primitive and does not require additional dynamic properties. Concretely, we focus on modeling dynamic transformations only for the dynamic primitives which benefits both efficiency and quality. We first employ a learnable decomposition strategy to separate the primitives, which relies on an additional parameter to classify primitives as static or dynamic. For the dynamic primitives, we employ a compact multi-resolution 4D Hash mapper to transform these primitives from canonical space into deformation space at each timestamp, and then mix the static and dynamic primitives to produce the final output. This divide-and-conquer method facilitates efficient training and reduces storage redundancy. Our method not only achieves state-of-the-art rendering quality while being 20X faster in training than previous SOTA methods with a minimum storage requirement of only 30MB on real-world datasets. Code is available at https://github.com/WuJH2001/swift4d.
Chinese Summary: Swift4D提出了一种分治式的3D高斯溅射方法,通过分别处理静态和动态元素,在实现卓越渲染质量的同时,将训练速度提升20倍并大幅降低存储需求。
English Summary: Swift4D is a novel divide-and-conquer 3D Gaussian Splatting method that separately handles static and dynamic elements, achieving superior rendering quality with 20x faster training and minimal storage requirements.

Authors:Negar Shahamiri, Moritz Rempe, Lukas Heine, Jens Kleesiek, Fabian Hörst
Title: Cracking the PUMA Challenge in 24 Hours with CellViT++ and nnU-Net
Abstract:
Automatic tissue segmentation and nuclei detection is an important task in pathology, aiding in biomarker extraction and discovery. The panoptic segmentation of nuclei and tissue in advanced melanoma (PUMA) challenge aims to improve tissue segmentation and nuclei detection in melanoma histopathology. Unlike many challenge submissions focusing on extensive model tuning, our approach emphasizes delivering a deployable solution within a 24-hour development timeframe, using out-of-the-box frameworks. The pipeline combines two models, namely CellViT++ for nuclei detection and nnU-Net for tissue segmentation. Our results demonstrate a significant improvement in tissue segmentation, achieving a Dice score of 0.750, surpassing the baseline score of 0.629. For nuclei detection, we obtained results comparable to the baseline in both challenge tracks. The code is publicly available at https://github.com/TIO-IKIM/PUMA.
中文: 我们通过整合CellViT++细胞核检测与nnU-Net组织分割的可部署方案,在PUMA挑战赛中显著提升组织分割性能(Dice评分0.750),同时保持与基线相当的细胞核检测效果。
English: Our deployable pipeline, combining CellViT++ for nuclei detection and nnU-Net for tissue segmentation, significantly improves tissue segmentation with a Dice score of 0.750 while maintaining comparable nuclei detection results in the PUMA challenge.

Authors:Boyu Chen, Ameenat L. Solebo, Daqian Shi, Jinge Wu, Paul Taylor
Title: Minuscule Cell Detection in AS-OCT Images with Progressive Field-of-View Focusing
Abstract:
Anterior Segment Optical Coherence Tomography (AS-OCT) is an emerging imaging technique with great potential for diagnosing anterior uveitis, a vision-threatening ocular inflammatory condition. A hallmark of this condition is the presence of inflammatory cells in the eye's anterior chamber, and detecting these cells using AS-OCT images has attracted research interest. While recent efforts aim to replace manual cell detection with automated computer vision approaches, detecting extremely small (minuscule) objects in high-resolution images, such as AS-OCT, poses substantial challenges: (1) each cell appears as a minuscule particle, representing less than 0.005\% of the image, making the detection difficult, and (2) OCT imaging introduces pixel-level noise that can be mistaken for cells, leading to false positive detections. To overcome these challenges, we propose a minuscule cell detection framework through a progressive field-of-view focusing strategy. This strategy systematically refines the detection scope from the whole image to a target region where cells are likely to be present, and further to minuscule regions potentially containing individual cells. Our framework consists of two modules. First, a Field-of-Focus module uses a vision foundation model to segment the target region. Subsequently, a Fine-grained Object Detection module introduces a specialized Minuscule Region Proposal followed by a Spatial Attention Network to distinguish individual cells from noise within the segmented region. Experimental results demonstrate that our framework outperforms state-of-the-art methods for cell detection, providing enhanced efficacy for clinical applications. Our code is publicly available at: https://github.com/joeybyc/MCD.
中文: 该框架通过渐进式视野聚焦策略,解决了前段光学相干断层扫描图像中微小炎症细胞检测的难题,借助目标区域分割和抗干扰识别技术,显著提升了细胞检测的临床效能。
English: The proposed framework utilizes a progressive field-of-view focusing strategy to overcome challenges in detecting minuscule inflammatory cells in AS-OCT images, significantly outperforming existing methods through targeted region segmentation and noise-resistant cell identification.

Authors:Yan Jiang, Hao Yu, Mengting Wei, Zhaodong Sun, Haoyu Chen, Xu Cheng, Guoying Zhao
Title: L2RW+: A Comprehensive Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task that aims to match pedestrian images captured under varying lighting conditions, which has drawn intensive research attention and achieved promising results. However, existing methods adopt the centralized training, ignoring the potential privacy concerns as the data is distributed across multiple devices or entities in reality. In this paper, we propose L2RW+, a benchmark that brings VI-ReID closer to real-world applications. The core rationale behind L2RW+ is that incorporating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing constrains. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we simulate the training under real-world data conditions that: 1) data from each camera is completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy restrictions, which is closer to real-world conditions. Comprehensive experiments show the feasibility and potential of decentralized VI-ReID training at both image and video levels. In particular, with increasing data scales, the performance gap between decentralized and centralized training decreases, especially in video-level VI-ReID. In unseen domains, decentralized training even achieves performance comparable to SOTA centralized methods. This work offers a novel research entry for deploying VI-ReID into real-world scenarios and can benefit the community. Code is available at: https://github.com/Joey623/L2RW.
中文: 本文提出L2RW+基准,通过模拟现实世界中数据隔离场景下的分散式训练方法解决可见光-红外行人重识别中的隐私问题,实验表明该方法在大规模数据下性能接近集中式训练。
English: This paper introduces L2RW+, a decentralized training benchmark for visible-infrared person re-identification that addresses privacy concerns by simulating real-world data isolation scenarios, demonstrating competitive performance with centralized methods especially at larger data scales.

Authors:Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali
Title: TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification
Abstract:
Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. Our models achieved superior accuracy on 9 of 11 base-to-novel datasets, including ImageNet, SUN397, and Caltech101, while maintaining a strictly training-free paradigm. Our TLAC model achieved an overall accuracy of 83.44%, surpassing the previous state-of-the-art few-shot methods by a margin of 6.75%. Compared to other training-free approaches, our TLAC method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous methods. Our Code is available at https://github.com/ans92/TLAC
中文: 本文提出无需训练的SLAC和TLAC方法,通过大型多模态模型识别图像对象并利用CLIP文本编码器进行语义匹配,无需微调即可在多个数据集上实现最先进的零样本分类精度。
English: This paper introduces training-free methods SLAC and TLAC that enhance CLIP's zero-shot image classification by leveraging Large Multimodal Models to identify objects and CLIP's text encoder for semantic matching, achieving state-of-the-art accuracy without additional training.

Authors:Sándor Battaglini-Fischer, Nishanthi Srinivasan, Bálint László Szarvas, Xiaoyu Chu, Alexandru Iosup
Title: FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents
Abstract:
Large Language Model (LLM) services such as ChatGPT, DALLE, and Cursor have quickly become essential for society, businesses, and individuals, empowering applications such as chatbots, image generation, and code assistance. The complexity of LLM systems makes them prone to failures and affects their reliability and availability, yet their failure patterns are not fully understood, making it an emerging problem. However, there are limited datasets and studies in this area, particularly lacking an open-access tool for analyzing LLM service failures based on incident reports. Addressing these problems, in this work we propose FAILS, the first open-sourced framework for incident reports collection and analysis on different LLM services and providers. FAILS provides comprehensive data collection, analysis, and visualization capabilities, including:(1) It can automatically collect, clean, and update incident data through its data scraper and processing components;(2) It provides 17 types of failure analysis, allowing users to explore temporal trends of incidents, analyze service reliability metrics, such as Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF);(3) It leverages advanced LLM tools to assist in data analysis and interpretation, enabling users to gain observations and insights efficiently. All functions are integrated in the backend, allowing users to easily access them through a web-based frontend interface. FAILS supports researchers, engineers, and general users to understand failure patterns and further mitigate operational incidents and outages in LLM services. The framework is publicly available on https://github.com/atlarge-research/FAILS.
中文: 像ChatGPT这样的大型语言模型服务应用广泛但易出故障,而提出的开源框架FAILS能自动收集、分析和可视化故障数据,帮助用户理解并减少这些故障。
English: LLM services like ChatGPT are widely used but prone to failures, and the proposed open-source framework FAILS enables automated collection, analysis, and visualization of incident data to help users understand and mitigate these failures.

Authors:Enze Liu, Bowen Zheng, Wayne Xin Zhao, Ji-Rong Wen
Title: Bridging Textual-Collaborative Gap through Semantic Codes for Sequential Recommendation
Abstract:
In recent years, substantial research efforts have been devoted to enhancing sequential recommender systems by integrating abundant side information with ID-based collaborative information. This study specifically focuses on leveraging the textual metadata (e.g., titles and brands) associated with items. While existing methods have achieved notable success by combining text and ID representations, they often struggle to strike a balance between textual information embedded in text representations and collaborative information from sequential patterns of user behavior. In light of this, we propose CCFRec, a novel Code-based textual and Collaborative semantic Fusion method for sequential Recommendation. The key idea behind our approach is to bridge the gap between textual and collaborative information using semantic codes. Specifically, we generate fine-grained semantic codes from multi-view text embeddings through vector quantization techniques. Subsequently, we develop a code-guided semantic-fusion module based on the cross-attention mechanism to flexibly extract and integrate relevant information from text representations. In order to further enhance the fusion of textual and collaborative semantics, we introduce an optimization strategy that employs code masking with two specific objectives: masked code modeling and masked sequence alignment. The merit of these objectives lies in leveraging mask prediction tasks and augmented item representations to capture code correlations within individual items and enhance the sequence modeling of the recommendation backbone. Extensive experiments conducted on four public datasets demonstrate the superiority of CCFRec, showing significant improvements over various sequential recommendation models. Our code is available at https://github.com/RUCAIBox/CCFRec.
中文摘要:本文提出CCFRec模型,通过语义编码和跨注意力机制融合文本元数据与协同信息,在多个数据集上验证了其在序列推荐任务中的优越性能。
English Summary: This paper introduces CCFRec, a novel sequential recommendation model that bridges the gap between textual metadata and collaborative information through semantic codes and cross-attention fusion, demonstrating superior performance across multiple datasets.

Authors:Cheng Deng, Luoyang Sun, Jiwen Jiang, Yongcheng Zeng, Xinjian Wu, Wenxin Zhao, Qingfa Xiao, Jiachuan Wang, Haoyang Li, Lei Chen, Lionel M. Ni, Haifeng Zhang, Jun Wang
Title: PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Abstract:
While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM's suitability for peripheral applications. The PLM series models are publicly available at https://github.com/plm-team/PLM.
中文: PLM是一种与边缘系统协同设计的新型小型语言模型,采用多头潜在注意力机制和平方法ReLU激活函数来降低内存占用,在保持最少激活参数的同时,在多项任务上超越了现有模型性能。
English: The PLM is a novel small language model co-designed with edge system constraints, featuring a Multi-head Latent Attention mechanism and squared ReLU activation to reduce memory usage while outperforming existing models on multiple tasks with the fewest activated parameters.

Authors:Taichi Murayama, Dongwoo Lim, Akira Matsui, Tsukasa Tanihara
Title: The "recognition," "belief," and "action" regarding conspiracy theories: An empirical study using large-scale samples from Japan and the United States
Abstract:
Conspiracy theories present significant societal challenges, shaping political behavior, eroding public trust, and disrupting social cohesion. Addressing their impact requires recognizing that conspiracy engagement is not a singular act but a multi-stage process involving distinct cognitive and behavioral transitions. In this study, we investigate this sequential progression, "recognition," "belief," and "action" (demonstrative action and diffusion action), using nationally representative surveys from the United States (N=13,578) and Japan (N=16,693). Applying a Bayesian hierarchical model, we identify the key social, political, and economic factors that drive engagement at each stage, providing a structured framework for understanding the mechanisms underlying conspiracy theory adoption and dissemination. We find that recognition serves as a crucial gateway determining who transitions to belief, and that demonstrative and diffusion actions are shaped by distinct factors. Demonstrative actions are more prevalent among younger, higher-status individuals with strong political alignments, whereas diffusion actions occur across broader demographics, particularly among those engaged with diverse media channels. Our findings further reveal that early-life economic and cultural capital significantly influence the shape of conspiratorial engagement, emphasizing the role of life-course experiences. These insights highlight the necessity of distinguishing between different forms of conspiracy engagement and highlight the importance of targeted interventions that account for structural, cultural, and psychological factors to mitigate their spread and societal impact.
中文: 本研究通过美国和日本的全国性调查,分析了阴谋论参与的识别、相信和行动三阶段过程,揭示了不同人口特征和媒体渠道对各阶段的驱动作用,强调需针对结构性、文化和心理因素采取干预措施以减轻其社会影响。
English: This study analyzes the multi-stage process of conspiracy theory engagement—recognition, belief, and action—identifying distinct demographic and media factors driving each phase through U.S. and Japanese surveys, emphasizing the need for targeted interventions to counter their societal impact.

Authors:Hongyu Sun, Qiuhong Ke, Ming Cheng, Yongcai Wang, Deying Li, Chenhui Gou, Jianfei Cai
Title: Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
Abstract:
This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data (often inaccessible during online inference) and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop \textbf{Point-Cache}, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness. Code is available at https://github.com/auniquesun/Point-Cache.
中文: 本文提出Point-Cache,一种无需训练的层次化缓存模型,使点云识别模型能够仅基于在线测试数据适应分布变化并识别已知和未知类别,在多个基准测试中取得了显著性能提升。
English: This paper introduces Point-Cache, a training-free hierarchical cache model that enables point cloud recognition models to adapt to distribution shifts and recognize both seen and unseen classes using only online test data, achieving significant performance gains across multiple benchmarks.

Authors:Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu
Title: Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Abstract:
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
Chinese: 本研究探讨不同大型视觉语言模型对多模态讽刺的理解,发现模型之间及同一模型在不同提示下存在显著差异,挑战了二元标注范式,提倡采用多视角建模方法。
English: This study investigates how different large vision-language models interpret multimodal sarcasm, revealing significant variations both across models and within the same model under different prompts, challenging binary labeling and advocating for multi-perspective modeling.

Authors:Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara
Title: Hyperbolic Safety-Aware Vision-Language Models
Abstract:
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.
中文: 本文提出HySAC模型,通过双曲空间分层编码安全与不安全内容,使视觉语言模型既能有效识别不安全内容,又能灵活控制检索结果,在提升安全性的同时保持模型适应性和可解释性。
English: This paper introduces HySAC, a novel safety-aware CLIP model that uses hyperbolic space to hierarchically encode safe and unsafe content, enabling both effective unsafe content classification and flexible retrieval while maintaining adaptability and interpretability.

Authors:Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, Shohreh Kasaei
Title: A Comprehensive Survey on Knowledge Distillation
Abstract:
Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: https://github.com/IPL-Sharif/KD_Survey
中文: 本文从蒸馏来源、算法机制、模态应用等维度系统综述了知识蒸馏方法,针对大模型部署难题提出创新分类体系,并探讨了扩散模型、大语言模型等前沿领域的应用挑战与发展方向。
English: This comprehensive survey reviews knowledge distillation methods from multiple perspectives, addressing deployment challenges of large models while introducing novel categorization frameworks and examining emerging applications like diffusion models and LLMs.

Authors:Ali Raeisdanaei, Juho Kim, Michael Liao, Sparsh Kochhar
Title: An LLM-Integrated Framework for Completion, Management, and Tracing of STPA
Abstract:
In many safety-critical engineering domains, hazard analysis techniques are an essential part of requirement elicitation. Of the methods proposed for this task, STPA (System-Theoretic Process Analysis) represents a relatively recent development in the field. The completion, management, and traceability of this hazard analysis technique present a time-consuming challenge to the requirements and safety engineers involved. In this paper, we introduce a free, open-source software framework to build STPA models with several automated workflows powered by large language models (LLMs). In past works, LLMs have been successfully integrated into a myriad of workflows across various fields. Here, we demonstrate that LLMs can be used to complete tasks associated with STPA with a high degree of accuracy, saving the time and effort of the human engineers involved. We experimentally validate our method on real-world STPA models built by requirement engineers and researchers. The source code of our software framework is available at the following link: https://github.com/blueskysolarracing/stpa.
中文: 本文介绍了一个免费开源软件框架,利用大语言模型自动化STPA危险分析流程,显著提高了安全工程师的工作效率和准确性。
English: This paper introduces a free, open-source software framework that uses large language models to automate STPA hazard analysis workflows, significantly improving efficiency and accuracy for safety engineers.

Authors:Hang Ni, Jindong Han, Nengjun Zhu, Hao Liu
Title: Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning
Abstract:
Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals. However, existing GNN-based approaches implicitly follow the homophily principle (i.e., the "like attracts like" phenomenon) and fail to learn discriminative embedding for anomalies that connect vast normal nodes. Moreover, such approaches identify anomalies in a unified global perspective but overlook diversified abnormal patterns conditioned on local graph context, leading to suboptimal performance. To overcome the aforementioned limitations, in this paper, we propose a Multi-hypersphere Heterophilic Graph Learning (MHetGL) framework for unsupervised GAD. Specifically, we first devise a Heterophilic Graph Encoding (HGE) module to learn distinguishable representations for potential anomalies by purifying and augmenting their neighborhood in a fully unsupervised manner. Then, we propose a Multi-Hypersphere Learning (MHL) module to enhance the detection capability for context-dependent anomalies by jointly incorporating critical patterns from both global and local perspectives. Extensive experiments on ten real-world datasets show that MHetGL outperforms 14 baselines. Our code is publicly available at https://github.com/KennyNH/MHetGL.
中文:提出的MHetGL框架通过异质图编码学习可区分的异常表示,并采用多超球面学习从全局和局部视角捕捉上下文相关异常,从而克服了现有基于图神经网络的异常检测方法的局限性。
English: The proposed MHetGL framework overcomes limitations of existing GNN-based anomaly detection methods by introducing heterophilic graph encoding to learn distinguishable anomaly representations and multi-hypersphere learning to capture context-dependent anomalies from both global and local perspectives.

Authors:Zhengyuan Peng, Jinpeng Ma, Zhimin Sun, Ran Yi, Haichuan Song, Xin Tan, Lizhuang Ma
Title: MOS: Modeling Object-Scene Associations in Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) is a classification task that aims to classify both base and novel classes in unlabeled images, using knowledge from a labeled dataset. In GCD, previous research overlooks scene information or treats it as noise, reducing its impact during model training. However, in this paper, we argue that scene information should be viewed as a strong prior for inferring novel classes. We attribute the misinterpretation of scene information to a key factor: the Ambiguity Challenge inherent in GCD. Specifically, novel objects in base scenes might be wrongly classified into base categories, while base objects in novel scenes might be mistakenly recognized as novel categories. Once the ambiguity challenge is addressed, scene information can reach its full potential, significantly enhancing the performance of GCD models. To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. It achieves an exceptional average accuracy improvement of 4% on the challenging fine-grained datasets compared to state-of-the-art methods, emphasizing its superior performance in fine-grained GCD. The code is publicly available at https://github.com/JethroPeng/MOS
中文摘要:本文提出建模对象-场景关联(MOS)框架,通过将场景信息作为先验知识来解决广义类别发现中的模糊性挑战,在细粒度数据集上实现了4%的准确率提升。
English Summary: This paper introduces the Modeling Object-Scene Associations (MOS) framework to address the Ambiguity Challenge in Generalized Category Discovery by leveraging scene information as a prior, achieving a 4% accuracy improvement on fine-grained datasets.

Authors:Zhenxin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez
Title: Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
Abstract:
End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving. Code will be available at https://github.com/woxihuanjiangguo/Hydra-NeXt.
Chinese: Hydra-NeXt是一种新颖的多分支规划框架,通过统一轨迹预测、控制预测和轨迹优化,弥合了开环训练与闭环部署之间的差距,在自动驾驶中实现了卓越性能。
English: Hydra-NeXt is a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and trajectory refinement to bridge the gap between open-loop training and closed-loop deployment, achieving superior performance in autonomous driving.

Authors:Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, Kahou Tam, Zhanting Zhou, Haicheng Liao, Zhijiang Guo, Li Li, Chengzhong Xu
Title: A Survey on Federated Fine-tuning of Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated impressive success across various tasks. Integrating LLMs with Federated Learning (FL), a paradigm known as FedLLM, offers a promising avenue for collaborative model adaptation while preserving data privacy. This survey provides a systematic and comprehensive review of FedLLM. We begin by tracing the historical development of both LLMs and FL, summarizing relevant prior research to set the context. Subsequently, we delve into an in-depth analysis of the fundamental challenges inherent in deploying FedLLM. Addressing these challenges often requires efficient adaptation strategies; therefore, we conduct an extensive examination of existing Parameter-Efficient Fine-tuning (PEFT) methods and explore their applicability within the FL framework. To rigorously evaluate the performance of FedLLM, we undertake a thorough review of existing fine-tuning datasets and evaluation benchmarks. Furthermore, we discuss FedLLM's diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to foster future advancements in FedLLM. This survey aims to serve as a foundational resource for researchers and practitioners, offering valuable insights into the rapidly evolving landscape of federated fine-tuning for LLMs. It also establishes a roadmap for future innovations in privacy-preserving AI. We actively maintain a GitHub repo \href{https://github.com/Clin0212/Awesome-Federated-LLM-Learning}{https://github.com/Clin0212/Awesome-Federated-LLM-Learning} to track cutting-edge advancements in this field.
中文: 本综述系统性地探讨了联邦大语言模型(FedLLM),该模型将大语言模型与联邦学习相结合,在保护数据隐私的同时实现协作模型适配,涵盖了挑战、适配策略、应用及未来研究方向。
English: This survey systematically reviews FedLLM, which integrates Large Language Models with Federated Learning to enable collaborative model adaptation while preserving data privacy, covering challenges, adaptation strategies, applications, and future research directions.

Authors:Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang
Title: QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution
Abstract:
Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at https://github.com/linYDTHU/QDM.
中文: 四叉树扩散模型(QDM)提出了一种区域自适应扩散框架,利用四叉树结构选择性增强细节丰富区域,在保持高保真超分辨率性能的同时显著降低计算冗余,特别适用于医疗影像等大均匀区域图像处理。
English: The Quadtree Diffusion Model (QDM) introduces a region-adaptive diffusion framework that selectively enhances detail-rich areas using a quadtree structure, significantly reducing computational redundancy while maintaining high-fidelity super-resolution performance across diverse image types.

Authors:Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Wu
Title: Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis
Abstract:
Tabular data synthesis using diffusion models has gained significant attention for its potential to balance data utility and privacy. However, existing privacy evaluations often rely on heuristic metrics or weak membership inference attacks (MIA), leaving privacy risks inadequately assessed. In this work, we conduct a rigorous MIA study on diffusion-based tabular synthesis, revealing that state-of-the-art attacks designed for image models fail in this setting. We identify noise initialization as a key factor influencing attack efficacy and propose a machine-learning-driven approach that leverages loss features across different noises and time steps. Our method, implemented with a lightweight MLP, effectively learns membership signals, eliminating the need for manual optimization. Experimental results from the MIDST Challenge @ SaTML 2025 demonstrate the effectiveness of our approach, securing first place across all tracks. Code is available at https://github.com/Nicholas0228/Tartan_Federer_MIDST.
Chinese: 本研究通过开发一种基于机器学习的成员推断攻击方法,严格评估了基于扩散模型的表格数据合成中的隐私风险,该方法优于现有技术,并在MIDST Challenge @ SaTML 2025中荣获第一名。
English: This study rigorously evaluates privacy risks in diffusion-based tabular data synthesis by developing a machine learning-driven membership inference attack that outperforms existing methods, winning first place in the MIDST Challenge @ SaTML 2025.

Authors:Zhe Shan, Yang Liu, Lei Zhou, Cheng Yan, Heng Wang, Xia Xie
Title: ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object
Abstract:
The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM's generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at https://github.com/ShanZard/ROS-SAM.
中文: 本文提出ROS-SAM方法,通过领域自适应、增强特征区分度和上下文感知掩码生成,实现遥感视频的高质量交互式分割,IoU提升13%并展现出色的零样本能力。
English: This paper introduces ROS-SAM, a method that achieves high-quality interactive segmentation for remote sensing videos through domain adaptation, enhanced feature discriminability, and context-aware mask generation, improving IoU by 13% and demonstrating strong zero-shot performance.

Authors:Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, Parag Chaudhuri, Ganesh Ramakrishnan
Title: SPRINT: Script-agnostic Structure Recognition in Tables
Abstract:
Table Structure Recognition (TSR) is vital for various downstream tasks like information retrieval, table reconstruction, and document understanding. While most state-of-the-art (SOTA) research predominantly focuses on TSR in English documents, the need for similar capabilities in other languages is evident, considering the global diversity of data. Moreover, creating substantial labeled data in non-English languages and training these SOTA models from scratch is costly and time-consuming. We propose TSR as a language-agnostic cell arrangement prediction and introduce SPRINT, Script-agnostic Structure Recognition in Tables. SPRINT uses recently introduced Optimized Table Structure Language (OTSL) sequences to predict table structures. We show that when coupled with a pre-trained table grid estimator, SPRINT can improve the overall tree edit distance-based similarity structure scores of tables even for non-English documents. We experimentally evaluate our performance across benchmark TSR datasets including PubTabNet, FinTabNet, and PubTables-1M. Our findings reveal that SPRINT not only matches SOTA models in performance on standard datasets but also demonstrates lower latency. Additionally, SPRINT excels in accurately identifying table structures in non-English documents, surpassing current leading models by showing an absolute average increase of 11.12%. We also present an algorithm for converting valid OTSL predictions into a widely used HTML-based table representation. To encourage further research, we release our code and Multilingual Scanned and Scene Table Structure Recognition Dataset, MUSTARD labeled with OTSL sequences for 1428 tables in thirteen languages encompassing several scripts at https://github.com/IITB-LEAP-OCR/SPRINT
中文:SPRINT框架将表格结构识别视为与语言无关的单元格布局预测,不仅性能超越现有最优模型且延迟更低,在非英文文档处理上平均提升11.12%,同时发布了多语言数据集以推动相关研究。
English: The proposed SPRINT framework treats table structure recognition as language-agnostic cell arrangement prediction, demonstrating superior performance over state-of-the-art models with lower latency and achieving an 11.12% average improvement for non-English documents while releasing a multilingual dataset to advance research.

Authors:Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov
Title: RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks
Abstract:
Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with $O(n^2)$ time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite
中文摘要:RTD-Lite是一种高效算法,能以O(n²)复杂度比较加权图的拓扑特征,在显著减少计算时间的同时,可应用于降维和神经网络训练等任务。
English summary: RTD-Lite is an efficient algorithm that compares topological features of weighted graphs with O(n²) complexity, enabling applications in dimensionality reduction and neural network training while significantly reducing computation time.

Authors:Bhiman Kumar Baghel, Emma Jordan, Zheyuan Ryan Shi, Xiang Lorraine Li
Title: Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing
Abstract:
Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to inject knowledge (UnderEdit) or unintentionally disrupt unrelated neighboring knowledge (OverEdit). To address these challenges, we propose two complementary methods: iterative model editing, which applies successive edits to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to reduce OverEdit. Our extensive experiments show that these techniques improve editing performance across multiple LLMs, algorithms, and benchmarks, reducing UnderEdit by up to 38 percentage points and OverEdit by up to 6, while remaining broadly applicable to any locate-and-edit method. We release our code at https://github.com/bhimanbaghel/ResolveUnderOverEdit.
中文: 我们提出的迭代式和邻域辅助模型编辑方法有效缓解了大语言模型中的知识注入不足和过度干扰问题,显著提升了多种模型和基准测试中的编辑性能。
English: Our proposed iterative and neighbor-assisted model editing methods effectively reduce both UnderEdit and OverEdit issues in large language models, enhancing editing performance across various models and benchmarks.

Authors:Md Abu Bakr Siddique, Vaishnav Ramesh, Junliang Liu, Piyush Singh, Md Jahidul Islam
Title: UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis
Abstract:
The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: https://github.com/uf-robopi/UStyle.
中文摘要:本文提出首个数据驱动的水体风格迁移框架UStyle,通过结合基于物理的水体合成和专门设计的损失函数,在无需参考图像的情况下克服传统方法的局限,既能保持场景几何结构又能实现卓越的风格化效果。
English Summary: This paper introduces UStyle, the first data-driven framework for underwater image style transfer that overcomes traditional methods' limitations by integrating physics-based waterbody synthesis and specialized loss functions to preserve scene geometry while achieving superior stylization without requiring reference images.

Authors:Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu
Title: DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Abstract:
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign and the code is available at https://github.com/taco-group/DecAlign.
中文摘要:DecAlign框架通过将多模态表征解耦为独有与共享特征,采用原型引导的最优传输对齐策略,在保留模态特性的同时显著提升了跨模态语义一致性,并在多项基准测试中超越现有最优方法。
English Summary: The DecAlign framework effectively separates multimodal representations into unique and shared features, using advanced alignment strategies to enhance cross-modal collaboration and achieve state-of-the-art performance across multiple benchmarks.

Authors:Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu
Title: Towards a Unified Copernicus Foundation Model for Earth Vision
Abstract:
Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at https://github.com/zhu-xlab/Copernicus-FM.
中文摘要:哥白尼项目通过整合多传感器卫星数据并采用统一架构,推出了新一代地球观测基础模型,大幅提升了可扩展性和跨领域应用能力。
English Summary: The Copernicus project introduces a next-generation Earth observation foundation model that integrates multi-sensor satellite data through a unified architecture, significantly enhancing scalability and cross-domain applications.

Authors:Alexander Weers, Alexander H. Berger, Laurin Lux, Peter Schüffler, Daniel Rueckert, Johannes C. Paetzold
Title: From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis
Abstract:
The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method's final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients. Our implementation is publicly available at https://github.com/HistoGraph31/pix2pathology
中文摘要:本研究提出了一种基于生物信息的图框架,将全切片图像转化为可解释的表示形式用于癌症诊断,在显著减少资源使用的同时保持竞争力,并具备完全可解释性。
English Summary: This study introduces a biologically-informed graph framework that transforms whole-slide images into interpretable representations for cancer diagnosis, achieving competitive performance with significantly fewer resources while maintaining full interpretability.

Authors:Alexander Weers, Alexander H. Berger, Laurin Lux, Peter Schüffler, Daniel Rueckert, Johannes C. Paetzold
Title: A Graph-Based Framework for Interpretable Whole Slide Image Analysis
Abstract:
The histopathological analysis of whole-slide images (WSIs) is fundamental to cancer diagnosis but is a time-consuming and expert-driven process. While deep learning methods show promising results, dominant patch-based methods artificially fragment tissue, ignore biological boundaries, and produce black-box predictions. We overcome these limitations with a novel framework that transforms gigapixel WSIs into biologically-informed graph representations and is interpretable by design. Our approach builds graph nodes from tissue regions that respect natural structures, not arbitrary grids. We introduce an adaptive graph coarsening technique, guided by learned embeddings, to efficiently merge homogeneous regions while preserving diagnostically critical details in heterogeneous areas. Each node is enriched with a compact, interpretable feature set capturing clinically-motivated priors. A graph attention network then performs diagnosis on this compact representation. We demonstrate strong performance on challenging cancer staging and survival prediction tasks. Crucially, our resource-efficient model ($>$13x fewer parameters and $>$300x less data) achieves results competitive with a massive foundation model, while offering full interpretability through feature attribution. Our code is publicly available at https://github.com/HistoGraph31/pix2pathology.
中文摘要:本研究提出了一种基于生物信息的图框架,将全切片图像转化为可解释的表示形式用于癌症诊断,在显著减少资源使用的同时保持竞争力,并具备完全可解释性。
English Summary: This study introduces a biologically-informed graph framework that transforms whole-slide images into interpretable representations for cancer diagnosis, achieving competitive performance with significantly fewer resources while maintaining full interpretability.

Authors:Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
Title: How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook
Abstract:
Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
中文摘要:本调查首次系统综述了多模态时间序列分析这一新兴领域,详细阐述了利用其他模态提升时间序列分析的三大优势,并指出了未来研究方向。
English Summary: This survey introduces MM4TSA as an emerging field exploring how time series analysis can benefit from integrating multiple modalities, systematically reviewing three key benefits and outlining future research opportunities.

Authors:Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
Title: How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook
Abstract:
Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
中文摘要:本调查首次系统综述了多模态时间序列分析这一新兴领域,详细阐述了利用其他模态提升时间序列分析的三大优势,并指出了未来研究方向。
English Summary: This survey introduces MM4TSA as an emerging field exploring how time series analysis can benefit from integrating multiple modalities, systematically reviewing three key benefits and outlining future research opportunities.

Authors:Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
Title: Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
Abstract:
Recent vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the "safety mirage" where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address this issue, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under one-word attacks, MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%. Codes are available at https://github.com/OPTML-Group/VLM-Safety-MU. WARNING: There exist AI generations that may be offensive in nature.
中文: 近期视觉语言模型因监督安全微调产生的伪相关性易生成有害内容,而机器遗忘技术通过直接消除有害知识并保留模型能力,有效解决了这一问题。
English: Recent vision language models show vulnerability to generating harmful content due to spurious correlations from supervised safety fine-tuning, but machine unlearning effectively mitigates these risks by removing harmful knowledge while preserving model capabilities.

Authors:Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du
Title: StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model
Abstract:
For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: https://github.com/ubc-3d-vision-lab/StyleMorpheus.
中文: 本文提出StyleMorpheus,首个基于风格且能在非受控图像上训练的3D可变形人脸模型,无需精确三维重建即可实现照片级真实感渲染,并保持对人脸属性的解耦控制。
English: This paper introduces StyleMorpheus, a style-based 3D Morphable Face Model trained on in-the-wild images that achieves photorealistic rendering with disentangled control over facial attributes without requiring explicit 3D shapes.

Authors:Artem Nikonorov, Georgy Perevozchikov, Andrei Korepanov, Nancy Mehta, Mahmoud Afifi, Egor Ershov, Radu Timofte
Title: Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks
Abstract:
We present cmKAN, a versatile framework for color matching. Given an input image with colors from a source color distribution, our method effectively and accurately maps these colors to match a target color distribution in both supervised and unsupervised settings. Our framework leverages the spline capabilities of Kolmogorov-Arnold Networks (KANs) to model the color matching between source and target distributions. Specifically, we developed a hypernetwork that generates spatially varying weight maps to control the nonlinear splines of a KAN, enabling accurate color matching. As part of this work, we introduce a first large-scale dataset of paired images captured by two distinct cameras and evaluate the efficacy of our and existing methods in matching colors. We evaluated our approach across various color-matching tasks, including: (1) raw-to-raw mapping, where the source color distribution is in one camera's raw color space and the target in another camera's raw space; (2) raw-to-sRGB mapping, where the source color distribution is in a camera's raw space and the target is in the display sRGB space, emulating the color rendering of a camera ISP; and (3) sRGB-to-sRGB mapping, where the goal is to transfer colors from a source sRGB space (e.g., produced by a source camera ISP) to a target sRGB space (e.g., from a different camera ISP). The results show that our method outperforms existing approaches by 37.3% on average for supervised and unsupervised cases while remaining lightweight compared to other methods. The codes, dataset, and pre-trained models are available at: https://github.com/gosha20777/cmKAN
中文:cmKAN框架利用基于超网络生成样条的Kolmogorov-Arnold网络,在多种色彩匹配场景中实现卓越性能,以轻量化设计超越现有方法37.3%的显著优势。
English: The cmKAN framework utilizes Kolmogorov-Arnold Networks with hypernetwork-generated splines to achieve superior color matching across multiple scenarios, outperforming existing methods by 37.3% while maintaining a lightweight design.

Authors:Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei
Title: Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning
Abstract:
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.
中文: 本文针对RGB-IR多模态目标检测中的融合退化问题,提出了M²D-LIF框架,通过增强单模态学习和实现有效特征融合来提升检测性能。
English: This paper addresses the Fusion Degradation issue in RGB-IR multi-modal object detection by proposing the M²D-LIF framework, which enhances mono-modality learning and enables effective feature fusion to improve detection performance.

Authors:Hyunwoo Park, Baekryun Seong, Sang-Ki Ko
Title: SPECTra: Scalable Multi-Agent Reinforcement Learning with Permutation-Free Networks
Abstract:
In cooperative multi-agent reinforcement learning (MARL), the permutation problem where the state space grows exponentially with the number of agents reduces sample efficiency. Additionally, many existing architectures struggle with scalability, relying on a fixed structure tied to a specific number of agents, limiting their applicability to environments with a variable number of entities. While approaches such as graph neural networks (GNNs) and self-attention mechanisms have progressed in addressing these challenges, they have significant limitations as dense GNNs and self-attention mechanisms incur high computational costs. To overcome these limitations, we propose a novel agent network and a non-linear mixing network that ensure permutation-equivariance and scalability, allowing them to generalize to environments with various numbers of agents. Our agent network significantly reduces computational complexity, and our scalable hypernetwork enables efficient weight generation for non-linear mixing. Additionally, we introduce curriculum learning to improve training efficiency. Experiments on SMACv2 and Google Research Football (GRF) demonstrate that our approach achieves superior learning performance compared to existing methods. By addressing both permutation-invariance and scalability in MARL, our work provides a more efficient and adaptable framework for cooperative MARL. Our code is available at https://github.com/funny-rl/SPECTra.
中文: 合作多智能体强化学习面临状态空间指数增长和可扩展性受限的挑战,我们提出的置换等变且可扩展的网络通过降低计算成本和提高训练效率解决了这些问题,在基准测试中实现了优越性能。
English: Cooperative multi-agent reinforcement learning faces challenges with exponential state space growth and limited scalability, which our proposed permutation-equivariant and scalable networks overcome by reducing computational costs and improving training efficiency, achieving superior performance in benchmarks.

Authors:Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
Title: Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Abstract:
We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images, from which we extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models. Our code is available at https://github.com/Diffusion-RLHF/RPO.
中文摘要:Rich Preference Optimization (RPO)提出了一种创新流程,通过详细图像评析生成可操作的编辑指令,创建出增强型偏好对,相比传统基于奖励的方法能更有效地微调文生图扩散模型。
English Summary: Rich Preference Optimization (RPO) introduces a novel pipeline that uses detailed image critiques to generate actionable editing instructions, creating enhanced preference pairs for fine-tuning text-to-image diffusion models more effectively than traditional reward-based methods.

Authors:Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao
Title: A Survey of Direct Preference Optimization
Abstract:
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at https://github.com/liushunyu/awesome-direct-preference-optimization.
中文摘要:本综述系统梳理了直接偏好优化(DPO)方法,提出了新颖的分类框架并对其变体进行实证分析,同时探讨了这种高效语言模型对齐技术的实际应用与发展前景。
English Summary: This survey provides a comprehensive overview of Direct Preference Optimization (DPO), presenting a novel taxonomy and empirical analysis of its variants while discussing applications and future directions for this streamlined LLM alignment method.

Authors:Zhenyu Wang
Title: LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models
Abstract:
This paper introduces LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models. While Logit Lens has been a crucial method for understanding internal representations of language models, it was previously limited to earlier model architectures. Our work overcomes the limitations of existing implementations, enabling the technique to be applied to state-of-the-art architectures (such as Qwen-2.5 and Llama-3.1) while automating key analytical workflows. By developing component-specific hooks to capture both attention mechanisms and MLP outputs, our implementation achieves full compatibility with the HuggingFace transformer library while maintaining low inference overhead. The toolkit provides both interactive exploration and batch processing capabilities, supporting large-scale layer-wise analyses. Through open-sourcing our implementation, we aim to facilitate deeper investigations into the internal mechanisms of large-scale language models. The toolkit is openly available at https://github.com/zhenyu-02/LogitLens4LLMs.
中文: 本文介绍了LogitLens4LLMs工具包,它将Logit Lens技术扩展到现代大语言模型,支持对Qwen-2.5和Llama-3.1等先进架构的内部机制进行交互式和批量分析。
English: This paper presents LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models like Qwen-2.5 and Llama-3.1, enabling comprehensive analysis of internal mechanisms with minimal performance impact.

Authors:Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Title: VGGT: Visual Geometry Grounded Transformer
Abstract:
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.
Chinese: VGGT是一种前馈神经网络,能够从单张或多张图像中高效推断完整的3D场景属性,无需复杂后处理即可在多项任务中实现最先进的性能。
English: VGGT is a feed-forward neural network that efficiently infers comprehensive 3D scene attributes from varying numbers of views, achieving state-of-the-art performance across multiple tasks without complex post-processing.

Authors:Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang
Title: ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Abstract:
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism--its capability is often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Our code and dataset are publicly available at: https://github.com/KwaiVGI/ReCamMaster.
Chinese: ReCamMaster 是一种创新的视频重渲染框架,通过利用预训练文本-视频模型和精心构建的合成数据集,能够为输入视频生成新的摄像机轨迹,在视频生成任务中显著优于现有方法。
English: ReCamMaster is a novel framework that re-renders input videos with new camera trajectories by leveraging pre-trained text-to-video models and a curated synthetic dataset, significantly outperforming existing methods in video generation tasks.

Authors:Stefan Lionar, Jiabin Liang, Gim Hee Lee
Title: TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
Abstract:
We introduce TreeMeshGPT, an autoregressive Transformer designed to generate high-quality artistic meshes aligned with input point clouds. Instead of the conventional next-token prediction in autoregressive Transformer, we propose a novel Autoregressive Tree Sequencing where the next input token is retrieved from a dynamically growing tree structure that is built upon the triangle adjacency of faces within the mesh. Our sequencing enables the mesh to extend locally from the last generated triangular face at each step, and therefore reduces training difficulty and improves mesh quality. Our approach represents each triangular face with two tokens, achieving a compression rate of approximately 22% compared to the naive face tokenization. This efficient tokenization enables our model to generate highly detailed artistic meshes with strong point cloud conditioning, surpassing previous methods in both capacity and fidelity. Furthermore, our method generates mesh with strong normal orientation constraints, minimizing flipped normals commonly encountered in previous methods. Our experiments show that TreeMeshGPT enhances the mesh generation quality with refined details and normal orientation consistency.
中文: TreeMeshGPT是一种自回归Transformer,通过创新的自回归树序列和高效面片标记化,从输入点云生成高质量艺术网格,在细节、保真度和法向一致性方面优于现有方法。
English: TreeMeshGPT is an autoregressive Transformer that generates high-quality artistic meshes from input point clouds using a novel Autoregressive Tree Sequencing and efficient face tokenization, improving detail, fidelity, and normal orientation consistency over previous methods.

Authors:Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, Bryan Kian Hsiang Low
Title: Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space
Abstract:
Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This allows us to select the optimal LLM response at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget. Our code can be found at: https://github.com/chenzhiliang94/convo-plan-SCOPE.
Chinese: SCOPE提出了一种在语义空间内进行高效对话规划的新方法,通过建模对话语义的随机转换,无需额外的大语言模型查询模拟,实现了比传统算法快70倍的速度,同时获得更高的奖励收益。
English: SCOPE introduces a novel method for efficient conversation planning by modeling transitions in semantic space, eliminating the need for time-consuming LLM simulations and achieving 70 times faster performance while improving reward outcomes.

Authors:Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Abstract:
Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
中文摘要:TikZero通过图像表示将文本理解与图形程序生成解耦,实现了零样本文本到图形程序的合成,其性能超越基线模型,并在补充对齐数据时达到或超过大型商业系统的水平。
English Summary: TikZero enables zero-shot text-to-graphics program synthesis by decoupling text understanding from program generation through image representations, outperforming baseline models and matching larger commercial systems when supplemented with aligned data.

Authors:Piotr Bialas, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski
Title: NeuMC -- a package for neural sampling for lattice field theories
Abstract:
We present the \texttt{NeuMC} software package, based on \pytorch, aimed at facilitating the research on neural samplers in lattice field theories. Neural samplers based on normalizing flows are becoming increasingly popular in the context of Monte-Carlo simulations as they can effectively approximate target probability distributions, possibly alleviating some shortcomings of the Markov chain Monte-Carlo methods. Our package provides tools to create such samplers for two-dimensional field theories.
Chinese: \texttt{NeuMC} 软件包基于 \pytorch,为二维场论中的神经采样器研究提供工具,利用归一化流有效逼近目标概率分布,以改进传统蒙特卡洛方法的不足。
English: The \texttt{NeuMC} software package, built on \pytorch, supports neural sampler research for lattice field theories by offering tools to develop normalizing flow-based samplers that approximate target distributions and address limitations of traditional Monte-Carlo methods.

Authors:Piotr Bialas, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski
Title: NeuMC -- a package for neural sampling for lattice field theories
Abstract:
We present the \texttt{NeuMC} software package, based on \pytorch, aimed at facilitating the research on neural samplers in lattice field theories. Neural samplers based on normalizing flows are becoming increasingly popular in the context of Monte-Carlo simulations as they can effectively approximate target probability distributions, possibly alleviating some shortcomings of the Markov chain Monte-Carlo methods. Our package provides tools to create such samplers for two-dimensional field theories.
Chinese: \texttt{NeuMC} 软件包基于 \pytorch,为二维场论中的神经采样器研究提供工具,利用归一化流有效逼近目标概率分布,以改进传统蒙特卡洛方法的不足。
English: The \texttt{NeuMC} software package, built on \pytorch, supports neural sampler research for lattice field theories by offering tools to develop normalizing flow-based samplers that approximate target distributions and address limitations of traditional Monte-Carlo methods.

Authors:Seyed Mohammad Hadi Hosseini, Amir Mohammad Izadi, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Title: T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation
Abstract:
Although recent text-to-image generative models have achieved impressive performance, they still often struggle with capturing the compositional complexities of prompts including attribute binding, and spatial relationships between different entities. This misalignment is not revealed by common evaluation metrics such as CLIPScore. Recent works have proposed evaluation metrics that utilize Visual Question Answering (VQA) by decomposing prompts into questions about the generated image for more robust compositional evaluation. Although these methods align better with human evaluations, they still fail to fully cover the compositionality within the image. To address this, we propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models. Code is available at https://github.com/hadi-hosseini/ T2I-FineEval.
中文: 当前文本到图像生成模型常难以准确呈现复杂组合提示,现有评估指标亦不足以衡量这些方面,为此我们提出了一种新方法,通过分解图像和文本来进行更精确的评估,其性能优于现有技术。
English: Current text-to-image models often fail to accurately represent complex compositional prompts, and existing metrics inadequately assess these aspects, prompting the development of a new evaluation method that decomposes images and texts for more precise analysis, which outperforms previous approaches.

Authors:Balaji Rama, Kai Mei, Yongfeng Zhang
Title: Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery
Abstract:
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video-demo.
中文: Cerebrum作为AIOS的智能体SDK,通过模块化架构、社区平台和交互界面,为自主LLM智能体的开发、共享与评估提供了标准化解决方案,推动了该领域的统一发展。
English: Cerebrum is an Agent SDK for AIOS that standardizes autonomous LLM-based agent development through a modular architecture, community hub, and interactive interface, advancing the field with a unified framework.

Authors:Sanghyun Jo, Seo Jin Lee, Seungwoo Lee, Seohyung Hong, Hyungseok Seo, Kyungsu Kim
Title: COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation
Abstract:
Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at https://github.com/shjo-april/COIN.
中文摘要:COIN是一种无需标注的细胞实例分割框架,通过置信度引导的实例蒸馏技术逐步识别无错误实例,有效克服了无监督模型的局限性,在多个数据集上实现了超越现有方法的性能表现。
English Summary: COIN is an annotation-free cell instance segmentation framework that uses confidence-guided instance distillation to overcome the limitations of unsupervised models by progressively identifying error-free instances, achieving superior performance across multiple datasets.

Authors:Shuaifeng Jiao, Zhiwen Zeng, Zhuoqun Su, Xieyuanli Chen, Zongtan Zhou, Huimin Lu
Title: LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar
Abstract:
As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57\,Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at:https://github.com/nubot-nudt/LuSeg.
中文: 本研究开发了月球表面模拟系统、LunarSeg数据集及新型两阶段分割网络LuSeg,在保持高推理速度的同时,实现了障碍物分割的最先进性能。
English: This study introduces a lunar surface simulation system, the LunarSeg dataset, and a novel two-stage segmentation network, LuSeg, which achieves state-of-the-art performance in obstacle segmentation while maintaining high inference speed.

Authors:Tobias Morocutti, Florian Schmid, Jonathan Greif, Francesco Foscarin, Gerhard Widmer
Title: Exploring Performance-Complexity Trade-Offs in Sound Event Detection Models
Abstract:
We target the problem of developing new low-complexity networks for the sound event detection task. Our goal is to meticulously analyze the performance-complexity trade-off, aiming to be competitive with the large state-of-the-art models, at a fraction of the computational requirements. We find that low-complexity convolutional models previously proposed for audio tagging can be effectively adapted for event detection (which requires frame-wise prediction) by adjusting convolutional strides, removing the global pooling, and, importantly, adding a sequence model before the (now frame-wise) classification heads. Systematic experiments reveal that the best choice for the sequence model type depends on which complexity metric is most important for the given application. We also investigate the impact of enhanced training strategies such as knowledge distillation. In the end, we show that combined with an optimized training strategy, we can reach event detection performance comparable to state-of-the-art transformers while requiring only around 5% of the parameters. We release all our pre-trained models and the code for reproducing this work to support future research in low-complexity sound event detection at https://github.com/theMoro/EfficientSED.
中文摘要:本研究通过调整卷积步幅、移除全局池化及添加序列模型等结构改进,将音频标记模型有效适配于声音事件检测任务,结合优化训练策略后仅用约5%参数量即可达到与顶尖Transformer模型相当的性能。
English Summary: This study develops low-complexity networks for sound event detection by adapting audio tagging models through architectural modifications and training optimizations, achieving performance comparable to state-of-the-art transformers with only 5% of parameters.

Authors:Ziyue Wang, Chenghao Shi, Neng Wang, Qinghua Yu, Xieyuanli Chen, Huimin Lu
Title: BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion Model
Abstract:
Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.
中文: BEVDiffLoc是一种新颖的端到端激光雷达定位框架,它利用鸟瞰图表示和扩散模型迭代优化位姿估计,在基准数据集上展现出优于基线方法的性能。
English: BEVDiffLoc is a novel end-to-end LiDAR localization framework that uses Bird's-Eye-View representation and a diffusion model to iteratively refine pose estimation, demonstrating superior performance over baseline methods on benchmark datasets.

Authors:Insu Jang, Runyu Lu, Nikhil Bansal, Ang Chen, Mosharaf Chowdhury
Title: Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware
Abstract:
Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. In this paper, we present Cornstarch, the first general-purpose distributed MLLM training framework. Cornstarch facilitates modular MLLM construction, enables composable parallelization of constituent models, and introduces MLLM-specific optimizations to pipeline and context parallelism for efficient distributed MLLM training. Our evaluation shows that Cornstarch outperforms state-of-the-art solutions by up to $1.57\times$ in terms of training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.
Chinese: Cornstarch 是首个通用的分布式多模态大语言模型训练框架,通过模块化构建、可组合并行化及针对性优化,将训练吞吐量提升至现有方案的1.57倍,并已开源发布。
English: Cornstarch is a pioneering distributed training framework for multimodal large language models (MLLMs), offering modular construction, composable parallelization, and specialized optimizations that boost training throughput by up to 1.57 times compared to existing solutions.

Authors:Samuel Mallick, Gianpietro Battocletti, Qizhang Dong, Azita Dabiri, Bart De Schutter
Title: Learning-Based MPC for Fuel Efficient Control of Autonomous Vehicles with Discrete Gear Selection
Abstract:
Co-optimization of both vehicle speed and gear position via model predictive control (MPC) has been shown to offer benefits for fuel-efficient autonomous driving. However, optimizing both the vehicle's continuous dynamics and discrete gear positions may be too computationally intensive for a real-time implementation. This work proposes a learning-based MPC scheme to address this issue. A policy is trained to select and fix the gear positions across the prediction horizon of the MPC controller, leaving a significantly simpler continuous optimization problem to be solved online. In simulation, the proposed approach is shown to have a significantly lower computation burden and a comparable performance, with respect to pure MPC-based co-optimization.
Chinese: 本研究提出了一种基于学习的模型预测控制方法,通过预先选择档位来简化优化过程,在保持与传统协同优化方法相当性能的同时,显著降低了计算负担。
English: This study introduces a learning-based model predictive control (MPC) method that pre-selects gear positions to simplify the optimization process, reducing computational demands while maintaining performance comparable to traditional co-optimization approaches.

Authors:Fengyu Li, Yilin Li, Junhao Zhu, Lu Chen, Yanfei Zhang, Jia Zhou, Hui Zu, Jingwen Zhao, Yunjun Gao
Title: AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation
Abstract:
Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end-to-end agentic system featured with a knowledge graph (KG)-powered retrieval-augmented generation (RAG) and anti-hallucination multi-agents. Specifically, AIstorian introduces an in-context learning based chunking strategy and a KG-based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi-agents to conduct on-the-fly hallucination detection and error-type-aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two-step training approach combining data augmentation-enhanced supervised fine-tuning with stylistic preference optimization. Extensive experiments on a real-life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: https://github.com/ZJU-DAILY/AIstorian.
Chinese: 华为开发的AIstorian系统通过基于知识图谱的检索增强生成与多智能体抗幻觉框架,在历史传记生成中实现了事实准确性提升3.8倍、幻觉率降低47.6%的突破性进展。
English: Huawei's AIstorian system enhances historical biography generation by employing a knowledge graph-based RAG and multi-agent anti-hallucination framework, achieving a 3.8x boost in factual accuracy and a 47.6% reduction in hallucinations compared to existing methods.

Authors:M. Akın Yılmaz, Ahmet Bilican, A. Murat Tekalp
Title: FG-DFPN: Flow Guided Deformable Frame Prediction Network
Abstract:
Video frame prediction remains a fundamental challenge in computer vision with direct implications for autonomous systems, video compression, and media synthesis. We present FG-DFPN, a novel architecture that harnesses the synergy between optical flow estimation and deformable convolutions to model complex spatio-temporal dynamics. By guiding deformable sampling with motion cues, our approach addresses the limitations of fixed-kernel networks when handling diverse motion patterns. The multi-scale design enables FG-DFPN to simultaneously capture global scene transformations and local object movements with remarkable precision. Our experiments demonstrate that FG-DFPN achieves state-of-the-art performance on eight diverse MPEG test sequences, outperforming existing methods by 1dB PSNR while maintaining competitive inference speeds. The integration of motion cues with adaptive geometric transformations makes FG-DFPN a promising solution for next-generation video processing systems that require high-fidelity temporal predictions. The model and instructions to reproduce our results will be released at: https://github.com/KUIS-AI-Tekalp-Research Group/frame-prediction
中文: FG-DFPN提出了一种新颖的视频帧预测架构,通过将光流与可变形卷积相结合来有效建模复杂时空动态,在多个测试序列上实现了最优性能,同时保持了高效的推理速度。
English: FG-DFPN introduces a novel video frame prediction architecture that integrates optical flow with deformable convolutions to effectively model complex spatio-temporal dynamics, achieving state-of-the-art performance on multiple test sequences while maintaining efficient inference speeds.

Authors:Moein Sorkhei, Emir Konuk, Kevin Smith, Christos Matsoukas
Title: APLA: A Simple Adaptation Method for Vision Transformers
Abstract:
Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at https://github.com/MoeinSorkhei/APLA.
中文: APLA提出了一种新颖的视觉变换器适配方法,仅更新注意力机制后的投影层,无需改变架构或增加参数即可显著降低计算成本和内存使用,同时获得卓越性能。
English: APLA introduces a novel adaptation method for vision transformers by updating only the post-attention projection layer, achieving superior performance while significantly reducing computational costs and memory usage without architectural changes or added parameters.

Authors:Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Yong Man Ro
Title: MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Abstract:
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early AV-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.72% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.
中文: 本研究提出了一种高效的多模态语音大模型框架,通过早期融合和自适应令牌分配策略,在LRS3数据集上实现了0.72%的词错误率,同时将令牌使用量减少86%并降低35.7%的计算量。
English: This study introduces an efficient multimodal speech LLM framework that reduces token usage by 86% and computational load by 35.7% while achieving state-of-the-art 0.72% WER on LRS3 through early fusion and adaptive token allocation strategies.

Authors:Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
Title: Are formal and functional linguistic mechanisms dissociated in language models?
Abstract:
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the "circuits", or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another's task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.
中文: 大语言模型在形式与功能语言任务间的处理回路重叠甚少,尚未形成统一的形式网络,但跨任务忠实性表明形式任务可能存在共享机制。
English: Large language models exhibit minimal overlap between circuits for formal and functional linguistic tasks, with no unified formal network emerging, yet cross-task faithfulness suggests potential shared mechanisms for formal tasks.

Authors:Yuanshuo Zhang, Yuchen Hou, Bohan Tang, Shuo Chen, Muhan Zhang, Xiaowen Dong, Siheng Chen
Title: GNNs as Predictors of Agentic Workflow Performances
Abstract:
Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic workflow performances, avoiding repeated LLM invocations for evaluation. To empirically ground this position, we construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances. With extensive experiments, we arrive at the following conclusion: GNNs are simple yet effective predictors. This conclusion supports new applications of GNNs and a novel direction towards automating agentic workflow optimization. All codes, models, and data are available at https://github.com/youngsoul0731/Flora-Bench.
中文: 本文提出将智能体工作流建模为计算图,并利用图神经网络(GNN)作为高效性能预测器,通过FLORA-Bench平台验证了该方法能有效避免重复调用大语言模型,为工作流优化开辟了新方向。
English: This paper proposes using Graph Neural Networks (GNNs) to efficiently predict the performance of agentic workflows modeled as computational graphs, avoiding costly repeated LLM invocations and demonstrating their effectiveness through the FLORA-Bench platform.

Authors:Jonas Utz, Stefan Vocht, Anne Tjorven Buessen, Dennis Possart, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
Title: CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy
Abstract:
In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei-specific priors to outperform generic architectures like U-Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two-stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self-supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at https://github.com/jonasutz/CyclePose
中文摘要:近年来,针对显微图像中细胞核实例分割的神经网络需大量标注数据,而生成模型虽能合成数据但计算成本高;CyclePose作为集成合成数据生成与分割训练的混合框架,在无需标注数据的情况下优于其他弱监督或无监督方法。
English Summary: Recent neural networks for nuclei segmentation require large annotated datasets, but generative models can compensate by synthesizing data, though they are computationally expensive; CyclePose, a hybrid framework integrating synthetic data generation and segmentation training, outperforms other weakly or unsupervised methods without annotated data.

Authors:Sahil Kale, Vijaykant Nadadur
Title: Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
Abstract:
As LLMs grow more powerful, their most profound achievement may be recognising when to say "I don't know". Existing studies on LLM self-knowledge have been largely constrained by human-defined notions of feasibility, often neglecting the reasons behind unanswerability by LLMs and failing to study deficient types of self-knowledge. This study aims to obtain intrinsic insights into different types of LLM self-knowledge with a novel methodology: allowing them the flexibility to set their own feasibility boundaries and then analysing the consistency of these limits. We find that even frontier models like GPT-4o and Mistral Large are not sure of their own capabilities more than 80% of the time, highlighting a significant lack of trustworthiness in responses. Our analysis of confidence balance in LLMs indicates that models swing between overconfidence and conservatism in feasibility boundaries depending on task categories and that the most significant self-knowledge weaknesses lie in temporal awareness and contextual understanding. These difficulties in contextual comprehension additionally lead models to question their operational boundaries, resulting in considerable confusion within the self-knowledge of LLMs. We make our code and results available publicly at https://github.com/knowledge-verse-ai/LLM-Self_Knowledge_Eval
中文: 研究表明即使如GPT-4o和Mistral Large等前沿大语言模型也普遍缺乏可靠的自我认知能力,超过80%的情况下无法准确判断自身能力边界,尤其在时间感知和语境理解方面存在显著缺陷。
English: This study reveals that even advanced LLMs like GPT-4o and Mistral Large lack reliable self-knowledge, frequently struggling with temporal awareness and contextual understanding which causes them to question their own operational boundaries over 80% of the time.

Authors:Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong, Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo, Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan, Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
中文摘要:Step-Video-TI2V作为先进的300亿参数模型,能够根据文本和图像输入生成视频,在新基准测试中相比现有方案展现了最优性能。
English Summary: Step-Video-TI2V is a state-of-the-art 30B-parameter model that generates videos from text and image inputs, achieving top performance on a new benchmark compared to existing solutions.

Authors:Suchanun Piriyasatit, Ercan Engin Kuruoglu, Mehmet Sinan Ozeren
Title: Spatio-Temporal Graph Structure Learning for Earthquake Detection
Abstract:
Earthquake detection is essential for earthquake early warning (EEW) systems. Traditional methods struggle with low signal-to-noise ratios and single-station reliance, limiting their effectiveness. We propose a Spatio-Temporal Graph Convolutional Network (GCN) using Spectral Structure Learning Convolution (Spectral SLC) to model static and dynamic relationships across seismic stations. Our approach processes multi-station waveform data and generates station-specific detection probabilities. Experiments show superior performance over a conventional GCN baseline in terms of true positive rate (TPR) and false positive rate (FPR), highlighting its potential for robust multi-station earthquake detection. The code repository for this study is available at https://github.com/SuchanunP/eq_detector.
Chinese: 本研究提出了一种采用谱结构学习卷积的时空图卷积网络,通过建模多台站关系来改进地震检测,实验证明其在真阳性率和假阳性率上优于传统方法。
English: The study introduces a Spatio-Temporal Graph Convolutional Network with Spectral Structure Learning Convolution to enhance earthquake detection by modeling multi-station relationships, demonstrating superior performance in true and false positive rates compared to traditional methods.

Authors:Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
Title: Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
Abstract:
This work presents a first evaluation of two state-of-the-art Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on analogical reasoning, focusing on well-established nonverbal human IQ tests based on Raven's progressive matrices. We benchmark with the I-RAVEN dataset and its extension, I-RAVEN-X, which tests the ability to generalize to longer reasoning rules and ranges of the attribute values. To assess the influence of visual uncertainties on these symbolic analogical reasoning tests, we extend the I-RAVEN-X dataset, which otherwise assumes an oracle perception. We adopt a two-fold strategy to simulate this imperfect visual perception: 1) we introduce confounding attributes which, being sampled at random, do not contribute to the prediction of the correct answer of the puzzles, and 2) we smoothen the distributions of the input attributes' values. We observe a sharp decline in OpenAI's o3-mini task accuracy, dropping from 86.6% on the original I-RAVEN to just 17.0% -- approaching random chance -- on the more challenging I-RAVEN-X, which increases input length and range and emulates perceptual uncertainty. This drop occurred despite spending 3.4x more reasoning tokens. A similar trend is also observed for DeepSeek R1: from 80.6% to 23.2%. On the other hand, a neuro-symbolic probabilistic abductive model, ARLC, that achieves state-of-the-art performances on I-RAVEN, can robustly reason under all these out-of-distribution tests, maintaining strong accuracy with only a modest accuracy reduction from 98.6% to 88.0%. Our code is available at https://github.com/IBM/raven-large-language-models.
中文摘要:本研究基于瑞文推理测验评估OpenAI的o3-mini和DeepSeek R1的类比推理能力,发现在视觉不确定性条件下两者准确率急剧下降,而神经符号模型ARLC仍保持稳定表现。
English summary: This study evaluates OpenAI's o3-mini and DeepSeek R1 on analogical reasoning using Raven's matrices, finding their accuracy drops sharply under visual uncertainty conditions while the neuro-symbolic ARLC model maintains robust performance.

Authors:Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, Jian Luan
Title: Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
Abstract:
Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.
中文:强化学习显著提升了大型音频语言模型的理解能力,通过GRPO算法在MMAU基准测试中取得领先性能,但在听觉推理方面仍远逊于人类水平。
English: Reinforcement learning significantly enhances audio understanding in large audio language models, achieving state-of-the-art performance on the MMAU benchmark with the GRPO algorithm, yet still lags behind human auditory reasoning.

Authors:Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Title: FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Abstract:
Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the prefilling stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.
中文摘要:FastVID提出动态密度剪枝方法,通过保留视频时空结构的关键信息,在减少90.3%视频令牌的同时保持98%原始准确率,大幅提升视频大语言模型的推理效率。
English Summary: FastVID introduces a dynamic density pruning method that significantly reduces video token redundancy, cutting computational costs by over 90% while maintaining near-original accuracy in Video LLMs.

Authors:Leideng Shi, Juan Zhang
Title: Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation
Abstract:
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation, which aims to segment objects based on a given text description, with great significance in practical application. Previous studies fuse visual and linguistic modalities by explicit feature interaction, which fail to effectively excavate useful multimodal information from dual-branch encoder. In this letter, we design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities. We propose a correlation fusion module (CFM) to enhance multi-scale visual features by introducing adaptively noise in transformer, and integrate cross-modal aware features. In addition, MAFN employs multi-scale refinement convolution (MSRC) to adapt to the various orientations of objects at different scales to boost their representation ability to enhances segmentation accuracy. Extensive experiments have shown that MAFN is significantly more effective than the state of the art on RRSIS-D datasets. The source code is available at https://github.com/Roaxy/MAFN.
中文摘要:本文提出了一种多模态感知融合网络(MAFN),通过自适应噪声注入和多尺度优化有效对齐并融合视觉与语言特征,显著提升了遥感图像参照分割的精度,在基准数据集上实现了最优性能。
English Summary: This paper introduces a Multimodal-Aware Fusion Network (MAFN) that enhances remote sensing image segmentation by effectively aligning and fusing visual and linguistic features through adaptive noise injection and multi-scale refinement, achieving state-of-the-art performance on benchmark datasets.

Authors:Haonan Wang, Qixiang Zhang, Lehan Wang, Xuanqi Huang, Xiaomeng Li
Title: Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction
Abstract:
Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. Code and model weights are available at: https://github.com/xmed-lab/NEURONS.
中文摘要:NEURONS框架受视觉皮层启发,将fMRI到视频的重建解耦为四个层次化子任务,通过与扩散模型结合显著提升视频连贯性26.6%和语义准确率19.1%。
English Summary: The NEURONS framework decouples fMRI-to-video reconstruction into four hierarchical sub-tasks inspired by the visual cortex, significantly improving video consistency by 26.6% and semantic accuracy by 19.1% through integration with a diffusion model.

Authors:Neng Wang, Huimin Lu, Zhiqiang Zheng, Hesheng Wang, Yun-Hui Liu, Xieyuanli Chen
Title: Leveraging Semantic Graphs for Efficient and Robust LiDAR SLAM
Abstract:
Accurate and robust simultaneous localization and mapping (SLAM) is crucial for autonomous mobile systems, typically achieved by leveraging the geometric features of the environment. Incorporating semantics provides a richer scene representation that not only enhances localization accuracy in SLAM but also enables advanced cognitive functionalities for downstream navigation and planning tasks. Existing point-wise semantic LiDAR SLAM methods often suffer from poor efficiency and generalization, making them less robust in diverse real-world scenarios. In this paper, we propose a semantic graph-enhanced SLAM framework, named SG-SLAM, which effectively leverages the geometric, semantic, and topological characteristics inherent in environmental structures. The semantic graph serves as a fundamental component that facilitates critical functionalities of SLAM, including robust relocalization during odometry failures, accurate loop closing, and semantic graph map construction. Our method employs a dual-threaded architecture, with one thread dedicated to online odometry and relocalization, while the other handles loop closure, pose graph optimization, and map update. This design enables our method to operate in real time and generate globally consistent semantic graph maps and point cloud maps. We extensively evaluate our method across the KITTI, MulRAN, and Apollo datasets, and the results demonstrate its superiority compared to state-of-the-art methods. Our method has been released at https://github.com/nubot-nudt/SG-SLAM.
中文:提出的SG-SLAM框架通过将几何、语义和拓扑特征整合到语义图中,增强了同步定位与地图构建,在多种数据集上实现了实时性能和卓越的鲁棒性。
English: The proposed SG-SLAM framework enhances simultaneous localization and mapping by integrating geometric, semantic, and topological features into a semantic graph, enabling real-time performance and superior robustness across diverse datasets.

Authors:Rachel S. Y. Teo, Tan M. Nguyen
Title: MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling
Abstract:
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.
中文: 大规模预训练模型在任务适应时面临高昂的重训练成本,而提出的层专家混合方法通过将模型层作为专家组合,以最小计算开销增强结构知识交换,实现了高效微调。
English: Large-scale pre-trained models face high retraining costs for task adaptation, but the proposed Mixture of Layer Experts (MoLEx) method enables efficient fine-tuning by combining layers as experts to enhance structural knowledge exchange with minimal computational overhead.

Authors:Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang
Title: GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior
Abstract:
Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: https://github.com/silence-tang/GaussianIP.
中文: GaussianIP提出了一种两阶段框架,通过自适应人体蒸馏采样和视角一致优化,从文本和图像高效生成保留身份特征的逼真3D人体,显著提升了训练速度和面部服装细节表现。
English: GaussianIP introduces a two-stage framework using Adaptive Human Distillation Sampling and View-Consistent Refinement to efficiently generate detailed, identity-preserving 3D humans from text and images, significantly improving training speed and visual quality.

Authors:Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li
Title: SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets
Abstract:
With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model-based segmentation framework with four core technical innovations: First, the Multi-Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high-precision feature decoding through cross-resolution feature fusion via hierarchical attention. Second, the Multi-spacecraft Connected Component Analysis (MS-CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross-domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi-Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi-scale on-orbit multi-spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state-of-the-art performance with 89.87$\%$ mIoU and 99.98$\%$ mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open-sourced at https://github.com/Akibaru/SpaceSeg to provide critical technical support for next-generation space situational awareness systems.
中文: 本文提出基于视觉基础模型的SpaceSeg分割框架,通过四项核心技术突破实现了多航天器分割的最优性能,并开源了SpaceES数据集和代码为空间态势感知提供技术支持。
English: This paper introduces SpaceSeg, a vision foundation model-based framework with four key innovations that achieves state-of-the-art performance in multi-spacecraft segmentation, supported by the open-sourced SpaceES dataset and code.

Authors:Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum
Title: X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Abstract:
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. The experimental results show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks; specifically, for Llama3.2-1B-Instruct baseline, a 6.4x compression achieves the same average score by using only 3.6B training tokens and 70 GPU hours on AMD MI300, whereas a 10.6x compression have less than 0.1% average score drop with 7B training tokens and 140 GPU hours. The code for this work is available at https://github.com/AMD-AGI/AMD-Hybrid-Models.
中文: X-EcoMLA通过轻量级后训练蒸馏,将预训练的Transformer模型升级为混合多头潜在注意力变体,在保持性能的同时实现了显著的KV缓存压缩。
English: X-EcoMLA enables efficient post-training adaptation of pre-trained Transformer models into a hybrid multi-head latent attention variant, achieving significant KV cache compression without performance loss through lightweight distillation.

Authors:Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li
Title: DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
Abstract:
In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at https://github.com/Hongbin98/DriveGEN.
中文: DriveGEN通过无需训练的可控文本到图像扩散方法,在多样化分布外场景中保持精确三维几何特征,有效提升了自动驾驶三维检测的鲁棒性。
English: DriveGEN enhances autonomous driving 3D detection robustness through training-free controllable text-to-image diffusion that preserves precise 3D geometry across diverse out-of-distribution scenarios.

Authors:Wenbang Deng, Xieyuanli Chen, Qinghua Yu, Yunze He, Junhao Xiao, Huimin Lu
Title: A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data
Abstract:
Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at https://github.com/nubot-nudt/DOSS.
中文: 本文提出了一种面向特征的激光雷达开放集语义分割框架,通过双解码器网络和异常检测机制,在识别已知类别的同时有效检测未知物体,在多个数据集上验证了其优越性能。
English: This paper introduces a feature-oriented framework for LiDAR open-set semantic segmentation, which identifies unknown objects while maintaining known class recognition through a dual-decoder network and anomaly detection, demonstrating superior performance on benchmark datasets.

Authors:Weichen Zhang, Zile Zhou, Zhiheng Zheng, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang
Title: Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Abstract:
Spatial reasoning is a fundamental capability of embodied agents and has garnered widespread attention in the field of multimodal large language models (MLLMs). In this work, we propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of current state-of-the-art (SOTA) foundation models in open 3D space. Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator. We evaluate several SOTA MLLMs across various aspects of spatial reasoning, such as relative and absolute spatial relationships, situational reasoning, and object-centric spatial attributes. Our results reveal that: 1) MLLMs perform better at answering questions regarding relative spatial relationships than absolute spatial relationships, 2) MLLMs demonstrate similar spatial reasoning abilities for both egocentric and allocentric perspectives, and 3) Fine-tuning large models significantly improves their performance across different spatial reasoning tasks. We believe that our open-source data collection tools and in-depth analyses will inspire further research on MLLM spatial reasoning capabilities. The benchmark is available at https://github.com/WeichenZh/Open3DVQA.
中文:本文提出Open3DVQA基准,包含9千个视觉问答样本,用于评估多模态大语言模型在开放3D空间中的空间推理能力,发现模型在相对空间关系方面表现更佳且微调能显著提升性能。
English: This paper introduces Open3DVQA, a benchmark with 9k VQA samples to evaluate multimodal large language models' spatial reasoning in open 3D environments, revealing their strengths in relative spatial relationships and the benefits of fine-tuning.

Authors:Haihong Zhao, Chenyi Zi, Aochuan Chen, Jia Li
Title: A Survey of Cross-domain Graph Learning: Progress and Future Directions
Abstract:
Graph learning plays a vital role in mining and analyzing complex relationships involved in graph data, which is widely used in many real-world applications like transaction networks and communication networks. Foundation models in CV and NLP have shown powerful cross-domain capabilities that are also significant in graph domains. However, existing graph learning approaches struggle with cross-domain tasks. Inspired by successes in CV and NLP, cross-domain graph learning has once again become a focal point of attention to realizing true graph foundation models. In this survey, we present a comprehensive review and analysis of existing works on cross-domain graph learning. Concretely, we first propose a new taxonomy, categorizing existing approaches based on the learned cross-domain information: structure, feature, and structure-feature mixture. Next, we systematically survey representative methods in these categories. Finally, we discuss the remaining limitations of existing studies and highlight promising avenues for future research. Relevant papers are summarized and will be consistently updated at: https://github.com/cshhzhao/Awesome-Cross-Domain-Graph-Learning.
中文: 本文系统综述了跨领域图学习方法,按学习信息类型进行分类,并探讨了推动图基础模型发展的潜力与未来方向。
English: This survey comprehensively reviews cross-domain graph learning, categorizing methods by learned information types and analyzing their potential to advance graph foundation models.

Authors:Wuwei Huang, Renren Jin, Wen Zhang, Jian Luan, Bin Wang, Deyi Xiong
Title: Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation
Abstract:
Recent studies on end-to-end speech translation(ST) have facilitated the exploration of multilingual end-to-end ST and end-to-end simultaneous ST. In this paper, we investigate end-to-end simultaneous speech translation in a one-to-many multilingual setting which is closer to applications in real scenarios. We explore a separate decoder architecture and a unified architecture for joint synchronous training in this scenario. To further explore knowledge transfer across languages, we propose an asynchronous training strategy on the proposed unified decoder architecture. A multi-way aligned multilingual end-to-end ST dataset was curated as a benchmark testbed to evaluate our methods. Experimental results demonstrate the effectiveness of our models on the collected dataset. Our codes and data are available at: https://github.com/XiaoMi/TED-MMST.
中文摘要:本文研究多语言端到端同声传译,提出分离式与统一式解码器架构及同步与异步训练策略,并在自建多语言数据集上验证了模型有效性。
English Summary: This paper explores end-to-end simultaneous speech translation in multilingual settings, proposing separate and unified decoder architectures with joint synchronous and asynchronous training strategies, validated on a curated multilingual dataset.

Authors:Hongyang Wei, Shuaizheng Liu, Chun Yuan, Lei Zhang
Title: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
Abstract:
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at https://github.com/nonwhy/PURE.
中文摘要:PURE模型通过指令调优改造Lumina-mGPT多模态框架,创新性地感知图像退化程度、理解语义内容,并采用基于熵的采样策略自回归生成高质量图像,在复杂场景中实现了保真度与细节还原的突破。
English Summary: The proposed PURE model adapts the Lumina-mGPT autoregressive multimodal framework to enhance real-world image super-resolution by perceiving degradation levels, understanding image semantics, and restoring high-quality details through entropy-optimized token generation.

Authors:Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li
Title: Falcon: A Remote Sensing Vision-Language Foundation Model
Abstract:
This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.
中文: 本文提出了Falcon,一种面向遥感领域的统一视觉语言基础模型,通过简单自然语言指令在14项任务中表现卓越,并基于包含约7800万样本的大规模数据集进行训练。
English: This paper introduces Falcon, a unified vision-language foundation model for remote sensing that achieves impressive performance across 14 tasks using simple natural language instructions, supported by a large-scale dataset of approximately 78 million samples.

Authors:Bin Liu, Xiaohong Liu, Qin Luo, Ziqiao Shang, Jielei Chu, Lin Ma, Zhaoyu Li, Fei Teng, Guangtao Zhai, Tianrui Li
Title: Variational Bayesian Personalized Ranking
Abstract:
Recommendation systems have found extensive applications across diverse domains. However, the training data available typically comprises implicit feedback, manifested as user clicks and purchase behaviors, rather than explicit declarations of user preferences. This type of training data presents three main challenges for accurate ranking prediction: First, the unobservable nature of user preferences makes likelihood function modeling inherently difficult. Second, the resulting false positives (FP) and false negatives (FN) introduce noise into the learning process, disrupting parameter learning. Third, data bias arises as observed interactions tend to concentrate on a few popular items, exacerbating the feedback loop of popularity bias. To address these issues, we propose Variational BPR, a novel and easily implementable learning objective that integrates key components for enhancing collaborative filtering: likelihood optimization, noise reduction, and popularity debiasing. Our approach involves decomposing the pairwise loss under the ELBO-KL framework and deriving its variational lower bound to establish a manageable learning objective for approximate inference. Within this bound, we introduce an attention-based latent interest prototype contrastive mechanism, replacing instance-level contrastive learning, to effectively reduce noise from problematic samples. The process of deriving interest prototypes implicitly incorporates a flexible hard sample mining strategy, capable of simultaneously identifying hard positive and hard negative samples. Furthermore, we demonstrate that this hard sample mining strategy promotes feature distribution uniformity, thereby alleviating popularity bias. Empirically, we demonstrate the effectiveness of Variational BPR on popular backbone recommendation models. The code and data are available at: https://github.com/liubin06/VariationalBPR
推荐系统面临隐式反馈数据的挑战,包括用户偏好建模困难、误报和漏报导致的噪声以及流行度偏差,而提出的变分BPR方法通过似然优化、噪声消除和去偏机制有效解决了这些问题。
Recommendation systems face challenges from implicit feedback data, including difficulty in modeling user preferences, noise from false positives and negatives, and popularity bias, which are addressed by the proposed Variational BPR method through likelihood optimization, noise reduction, and debiasing mechanisms.

Authors:Worameth Chinchuthakun, Tossaporn Saengja, Nontawat Tritrong, Pitchaporn Rewatbowornwong, Pramook Khungurn, Supasorn Suwajanakorn
Title: LUSD: Localized Update Score Distillation for Text-Guided Image Editing
Abstract:
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64% overall.
中文摘要:本研究通过引入基于注意力的空间正则化和梯度过滤归一化两项改进,有效解决了分数蒸馏技术中的梯度变化问题,在图像编辑任务中实现了比现有最优方法更优的提示匹配度和背景保持效果。
English Summary: The proposed method introduces attention-based spatial regularization and gradient filtering-normalization to address gradient variations in score distillation techniques, achieving superior prompt fidelity and background preservation in image editing compared to state-of-the-art approaches.

Authors:Lilin Zhang, Chengpei Wu, Ning Yang
Title: Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data
Abstract:
Existing adversarial training (AT) methods often suffer from incomplete perturbation, meaning that not all non-robust features are perturbed when generating adversarial examples (AEs). This results in residual correlations between non-robust features and labels, leading to suboptimal learning of robust features. However, achieving complete perturbation, i.e., perturbing as many non-robust features as possible, is challenging due to the difficulty in distinguishing robust and non-robust features and the sparsity of labeled data. To address these challenges, we propose a novel approach called Weakly Supervised Contrastive Adversarial Training (WSCAT). WSCAT ensures complete perturbation for improved learning of robust features by disrupting correlations between non-robust features and labels through complete AE generation over partially labeled data, grounded in information theory. Extensive theoretical analysis and comprehensive experiments on widely adopted benchmarks validate the superiority of WSCAT. Our code is available at https://github.com/zhang-lilin/WSCAT.
中文: 现有对抗训练方法常因无法扰动所有非鲁棒特征而导致鲁棒性不足,而提出的弱监督对比对抗训练(WSCAT)通过弱监督对比学习确保完全扰动,从而提升鲁棒特征学习效果。
English: Existing adversarial training methods often fail to perturb all non-robust features, leading to suboptimal robustness, but the proposed WSCAT approach ensures complete perturbation through weakly supervised contrastive learning to enhance robust feature learning.

Authors:Ming Deng, Sijin Sun, Zihao Li, Xiaochuan Hu, Xing Wu
Title: FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings, which complicates identification. Existing methods mainly rely on spatial local features, failing to capture global information, while Transformers increase computational costs. To address this, the Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is proposed, which leverages frequency-domain learning to efficiently capture global features and mitigate ambiguity between objects and the background. FMNet introduces the Multi-Scale Frequency-Assisted Mamba-Like Linear Attention (MFM) module, integrating frequency and spatial features through a multi-scale structure to handle scale variations while reducing computational complexity. Additionally, the Pyramidal Frequency Attention Extraction (PFAE) module and the Frequency Reverse Decoder (FRD) enhance semantics and reconstruct features. Experimental results demonstrate that FMNet outperforms existing methods on multiple COD datasets, showcasing its advantages in both performance and efficiency. Code available at https://github.com/Chranos/FMNet.
中文: 提出的FMNet通过融合频域学习和多尺度空间特征,有效解决了伪装物体检测的难题,在多个数据集上实现了优越的性能和效率。
English: The proposed FMNet effectively addresses camouflaged object detection challenges by integrating frequency-domain learning with multi-scale spatial features, achieving superior performance and efficiency across multiple datasets.

Authors:Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun
Title: MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Abstract:
Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multimodal guidance with CFM, our model robustly preserves speaker-specific characteristics and enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements in LSE and FID score. Our code is available at https://github.com/Peter-SungwooCho/MAVFlow.
中文摘要:本研究提出一种条件流匹配零样本视听渲染器,通过双模态引导保持说话人特征一致性,有效提升跨语言视听翻译性能。
English Summary: This study introduces a conditional flow matching zero-shot audio-visual renderer that uses dual audio-visual guidance to maintain speaker consistency and enhance translation capabilities across languages.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Abstract:
Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA
中文: 传统写作助手通过句法和语义变化生成多样的图像描述,而人类描述则注重利用语用线索传达核心信息与视觉细节;为此提出的RONA策略,通过连贯关系作为可控轴,使多模态大语言模型生成的描述在多样性和真实性上优于基线模型。
English: Traditional writing assistants create diverse image captions through syntactic and semantic variations, but human captions emphasize conveying a central message with visual details using pragmatic cues, leading to the development of RONA, a prompting strategy for MLLMs that uses coherence relations to improve caption diversity and alignment with ground truth.

Authors:Gaotang Li, Yuzhong Chen, Hanghang Tong
Title: Taming Knowledge Conflicts in Language Models
Abstract:
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the superposition of contextual information and parametric memory, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available at https://github.com/GaotangLi/JUICE.
中文: 本研究揭示了语言模型中注意力头能同时处理上下文与参数记忆,产生叠加效应,并提出了JuICE这一无需微调的测试时干预方法,有效解决知识冲突,在多种数据集和模型上实现了最优性能。
English: This study reveals that attention heads in language models can simultaneously handle both contextual and parametric knowledge, leading to a superposition effect, and introduces JuICE, a test-time intervention method that effectively resolves knowledge conflicts without fine-tuning, achieving state-of-the-art performance across diverse datasets and models.

Authors:Yanjie Xu, Handing Xu, Tianmu Wang, Yaguan Li, Yunzhi Chen, Zhenguo Nie
Title: Rethinking Rotation-Invariant Recognition of Fine-grained Shapes from the Perspective of Contour Points
Abstract:
Rotation-invariant recognition of shapes is a common challenge in computer vision. Recent approaches have significantly improved the accuracy of rotation-invariant recognition by encoding the rotational invariance of shapes as hand-crafted image features and introducing deep neural networks. However, the methods based on pixels have too much redundant information, and the critical geometric information is prone to early leakage, resulting in weak rotation-invariant recognition of fine-grained shapes. In this paper, we reconsider the shape recognition problem from the perspective of contour points rather than pixels. We propose an anti-noise rotation-invariant convolution module based on contour geometric aware for fine-grained shape recognition. The module divides the shape contour into multiple local geometric regions(LGA), where we implement finer-grained rotation-invariant coding in terms of point topological relations. We provide a deep network composed of five such cascaded modules for classification and retrieval experiments. The results show that our method exhibits excellent performance in rotation-invariant recognition of fine-grained shapes. In addition, we demonstrate that our method is robust to contour noise and the rotation centers. The source code is available at https://github.com/zhenguonie/ANRICN_CGA.
Chinese: 本文提出了一种基于轮廓的抗噪声旋转不变卷积模块,通过局部几何区域中的点拓扑关系实现更精细的旋转不变编码,在细粒度形状识别中展现出卓越性能,并对轮廓噪声和旋转中心具有强鲁棒性。
English: This paper introduces a contour-based anti-noise rotation-invariant convolution module that enhances fine-grained shape recognition by encoding topological relations in local geometric regions, demonstrating superior performance and robustness to noise and rotation centers.

Authors:Zhicheng Feng, Xieyuanli Chen, Chenghao Shi, Lun Luo, Zhichao Chen, Yun-Hui Liu, Huimin Lu
Title: Image-Goal Navigation Using Refined Feature Guidance and Scene Graph Enhancement
Abstract:
In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.
中文: 本文提出了一种名为RFSG的新型图像目标导航方法,通过空间-通道注意力机制和自蒸馏机制增强特征学习能力,并利用图像场景图进行环境信息编码,在Gibson和HM3D数据集上取得了最优性能,同时保持了实时运行效率。
English: This paper presents RFSG, a novel image-goal navigation method that employs spatial-channel attention and self-distillation mechanisms to enhance feature learning while using image scene graphs for environmental encoding, achieving state-of-the-art performance on Gibson and HM3D datasets with real-time efficiency.

Authors:Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, Marinka Zitnik
Title: TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
Abstract:
Precision therapeutics require multimodal adaptive models that generate personalized treatment recommendations. We introduce TxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies. TxAgent evaluates how drugs interact at molecular, pharmacokinetic, and clinical levels, identifies contraindications based on patient comorbidities and concurrent medications, and tailors treatment strategies to individual patient characteristics. It retrieves and synthesizes evidence from multiple biomedical sources, assesses interactions between drugs and patient conditions, and refines treatment recommendations through iterative reasoning. It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation. The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939 and validated clinical insights from Open Targets. TxAgent outperforms leading LLMs, tool-use models, and reasoning agents across five new benchmarks: DrugPC, BrandPC, GenericPC, TreatmentPC, and DescriptionPC, covering 3,168 drug reasoning tasks and 456 personalized treatment scenarios. It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning. TxAgent generalizes across drug name variants and descriptions. By integrating multi-step inference, real-time knowledge grounding, and tool-assisted decision-making, TxAgent ensures that treatment recommendations align with established clinical guidelines and real-world evidence, reducing the risk of adverse events and improving therapeutic decision-making.
中文: TxAgent是一种人工智能系统,通过多步骤推理和实时检索211种工具的医学知识,提供个性化药物相互作用分析和治疗建议,在药物推理任务中达到92.1%的准确率,性能超过GPT-4o等领先模型。
English: TxAgent is an AI system that utilizes multi-step reasoning and real-time biomedical data from 211 tools to provide personalized drug interaction analysis and treatment recommendations, achieving 92.1% accuracy in drug reasoning tasks while outperforming leading models like GPT-4o.

Authors:Pedro Pessoa, Paul Campitelli, Douglas P. Shepherd, S. Banu Ozkan, Steve Pressé
Title: Mamba time series forecasting with uncertainty quantification
Abstract:
State space models, such as Mamba, have recently garnered attention in time series forecasting due to their ability to capture sequence patterns. However, in electricity consumption benchmarks, Mamba forecasts exhibit a mean error of approximately 8\%. Similarly, in traffic occupancy benchmarks, the mean error reaches 18\%. This discrepancy leaves us to wonder whether the prediction is simply inaccurate or falls within error given spread in historical data. To address this limitation, we propose a method to quantify the predictive uncertainty of Mamba forecasts. Here, we propose a dual-network framework based on the Mamba architecture for probabilistic forecasting, where one network generates point forecasts while the other estimates predictive uncertainty by modeling variance. We abbreviate our tool, Mamba with probabilistic time series forecasting, as Mamba-ProbTSF and the code for its implementation is available on GitHub (https://github.com/PessoaP/Mamba-ProbTSF). Evaluating this approach on synthetic and real-world benchmark datasets, we find Kullback-Leibler divergence between the learned distributions and the data--which, in the limit of infinite data, should converge to zero if the model correctly captures the underlying probability distribution--reduced to the order of $10^{-3}$ for synthetic data and $10^{-1}$ for real-world benchmark, demonstrating its effectiveness. We find that in both the electricity consumption and traffic occupancy benchmark, the true trajectory stays within the predicted uncertainty interval at the two-sigma level about 95\% of the time. We end with a consideration of potential limitations, adjustments to improve performance, and considerations for applying this framework to processes for purely or largely stochastic dynamics where the stochastic changes accumulate, as observed for example in pure Brownian motion or molecular dynamics trajectories.
Chinese: 作者提出了Mamba-ProbTSF,一种基于Mamba架构的双网络框架,既能生成点预测又能量化预测不确定性,在合成和真实世界基准测试中表现出色,KL散度显著降低且不确定性区间准确覆盖真实轨迹。
English: The authors introduce Mamba-ProbTSF, a dual-network framework based on the Mamba architecture that generates point forecasts and quantifies predictive uncertainty, demonstrating effectiveness in synthetic and real-world benchmarks with reduced Kullback-Leibler divergence and accurate uncertainty intervals.

Authors:Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari
Title: RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors
Abstract:
In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.
Chinese: 本文提出RI3D方法,通过将视图合成拆分为可见区域修复和缺失区域生成两个任务,并采用两个定制化扩散模型分别处理,结合两阶段优化策略,实现了从稀疏图像生成高质量新视图的效果。
English: This paper introduces RI3D, a 3DGS-based method that uses two specialized diffusion models—one for repairing visible areas and another for inpainting missing regions—to achieve high-quality novel view synthesis from sparse images through a two-stage optimization process.

Authors:Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang
Title: Towards Understanding Graphical Perception in Large Multimodal Models
Abstract:
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
Large multimodal models surprisingly struggle with simple perception tasks on infographics, prompting the development of an evaluation framework based on graphical perception theory that reveals critical limitations in their ability to generalize across chart types and understand visual elements.
English Summary:

Authors:Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam J. Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, John Chuang
Title: Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation
Abstract:
Earth observation (EO) data features diverse sensing platforms with varying spectral bands, spatial resolutions, and sensing modalities. While most prior work has constrained inputs to fixed sensors, a new class of any-sensor foundation models able to process arbitrary sensors has recently emerged. Contributing to this line of work, we propose Panopticon, an any-sensor foundation model built on the DINOv2 framework. We extend DINOv2 by (1) treating images of the same geolocation across sensors as natural augmentations, (2) subsampling channels to diversify spectral input, and (3) adding a cross attention over channels as a flexible patch embedding mechanism. By encoding the wavelength and modes of optical and synthetic aperture radar sensors, respectively, Panopticon can effectively process any combination of arbitrary channels. In extensive evaluations, we achieve state-of-the-art performance on GEO-Bench, especially on the widely-used Sentinel-1 and Sentinel-2 sensors, while out-competing other any-sensor models, as well as domain adapted fixed-sensor models on unique sensor configurations. Panopticon enables immediate generalization to both existing and future satellite platforms, advancing sensor-agnostic EO.
中文摘要:Panopticon是基于DINOv2框架构建的多传感器基础模型,通过光谱通道子采样和跨注意力机制处理任意组合的遥感数据,在多种传感器上实现最优性能,并能泛化至未来卫星平台。
English Summary: Panopticon is an any-sensor foundation model based on DINOv2 that processes diverse Earth observation data through spectral channel subsampling and cross-attention mechanisms, achieving state-of-the-art performance across multiple sensors while enabling generalization to future satellite platforms.

Authors:Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen
Title: FlowTok: Flowing Seamlessly Across Text and Image Tokens
Abstract:
Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at https://github.com/bytedance/1d-tokenizer.
Chinese: 本文提出FlowTok框架,通过将图像编码为紧凑的一维标记并在共享空间中直接进行流匹配来实现文本与图像间的跨模态转换,无需复杂条件机制即可实现高效生成并保持优异性能。
English: This paper introduces FlowTok, a novel framework that simplifies cross-modality generation by directly transforming between text and images through flow matching in a shared 1D token space, achieving high efficiency and performance without complex conditioning mechanisms.

Authors:Yafei Zhang, Murray Wang, Yu Wang, Xiaohui Wang
Title: RankPO: Preference Optimization for Job-Talent Matching
Abstract:
Matching job descriptions (JDs) with suitable talent requires models capable of understanding not only textual similarities between JDs and candidate resumes but also contextual factors such as geographical location and academic seniority. To address this challenge, we propose a two-stage training framework for large language models (LLMs). In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules, such as geographical alignment and research area overlap. While effective, this model primarily learns patterns that defined by the matching rules. In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO), termed Rank Preference Optimization (RankPO), to align the model with AI-curated pairwise preferences emphasizing textual understanding. Our experiments show that while the first-stage model achieves strong performance on rule-based data (nDCG@20 = 0.706), it lacks robust textual understanding (alignment with AI annotations = 0.46). By fine-tuning with RankPO, we achieve a balanced model that retains relatively good performance in the original tasks while significantly improving the alignment with AI preferences. The code and data are available at https://github.com/yflyzhang/RankPO.
中文: 本文提出一个两阶段训练框架,先通过基于现实匹配规则的对比学习训练大语言模型,再采用新型排序偏好优化方法强化文本理解能力,最终获得既能保持规则匹配性能、又显著提升与AI偏好对齐度的平衡模型。
English: This paper introduces a two-stage training framework for large language models that first uses contrastive learning based on real-world matching rules and then applies a novel Rank Preference Optimization method to enhance textual understanding, resulting in a balanced model that maintains rule-based performance while significantly improving alignment with AI preferences.

Authors:Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng
Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models
Abstract:
With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4
Chinese: 本文采用CNN和CLIP-ViT分类器检测AI生成图像并识别其来源模型,在Defactify 4竞赛中表现优异,总体排名前三。
English: This paper presents a method using CNN and CLIP-ViT classifiers to detect AI-generated images and identify their source models, achieving strong performance and ranking Top-3 in the Defactify 4 competition.

Authors:Xin Liu, Pei Liu, Guoming Tang
Title: ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Abstract:
The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5\% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.
中文摘要:ZSMerge是一种动态KV缓存压缩框架,通过细粒度令牌重要性评估、残差合并和零样本自适应机制,在保持生成质量的同时实现20:1的压缩比,显著提升大语言模型的内存效率和推理速度。
English Summary: ZSMerge is a dynamic KV cache compression framework that enhances memory efficiency and inference speed in large language models through fine-grained token importance evaluation, residual merging, and zero-shot adaptation, achieving a 20:1 compression ratio with minimal performance loss.

Authors:Xin Liu, Xudong Wang, Pei Liu, Guoming Tang
Title: ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Abstract:
The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5\% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.
中文摘要:ZSMerge是一种动态KV缓存压缩框架,通过细粒度令牌重要性评估、残差合并和零样本自适应机制,在保持生成质量的同时实现20:1的压缩比,显著提升大语言模型的内存效率和推理速度。
English Summary: ZSMerge is a dynamic KV cache compression framework that enhances memory efficiency and inference speed in large language models through fine-grained token importance evaluation, residual merging, and zero-shot adaptation, achieving a 20:1 compression ratio with minimal performance loss.

Authors:Fan Lyu, Tianle Liu, Zhang Zhang, Fuyuan Hu, Liang Wang
Title: Test-Time Discovery via Hashing Memory
Abstract:
We introduce Test-Time Discovery (TTD) as a novel task that addresses class shifts during testing, requiring models to simultaneously identify emerging categories while preserving previously learned ones. A key challenge in TTD is distinguishing newly discovered classes from those already identified. To address this, we propose a training-free, hash-based memory mechanism that enhances class discovery through fine-grained comparisons with past test samples. Leveraging the characteristics of unknown classes, our approach introduces hash representation based on feature scale and directions, utilizing Locality-Sensitive Hashing (LSH) for efficient grouping of similar samples. This enables test samples to be easily and quickly compared with relevant past instances. Furthermore, we design a collaborative classification strategy, combining a prototype classifier for known classes with an LSH-based classifier for novel ones. To enhance reliability, we incorporate a self-correction mechanism that refines memory labels through hash-based neighbor retrieval, ensuring more stable and accurate class assignments. Experimental results demonstrate that our method achieves good discovery of novel categories while maintaining performance on known classes, establishing a new paradigm in model testing. Our code is available at https://github.com/fanlyu/ttd.
中文摘要:本文提出测试时发现(TTD)方法,通过基于哈希的内存机制和协同分类策略,在测试阶段有效识别新类别同时保持已知类别性能,无需重新训练模型。
English Summary: The paper introduces Test-Time Discovery (TTD), a training-free method using hash-based memory and collaborative classification to identify new classes during testing while maintaining known class performance.

Authors:Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
Title: Neighboring Autoregressive Modeling for Efficient Visual Generation
Abstract:
Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256\times 256$ and UCF101 demonstrate that NAR achieves 2.4$\times$ and 8.6$\times$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.
Chinese: 本文提出邻近自回归建模(NAR),通过将视觉生成重构为基于时空邻近性的渐进式外延过程,在图像和视频生成任务中实现了比现有方法更高的吞吐量和更优的性能指标。
English: The paper introduces Neighboring Autoregressive Modeling (NAR), a novel visual generation paradigm that replaces raster-order token prediction with a progressive outpainting approach based on spatial-temporal proximity, achieving higher throughput and superior performance in image and video generation tasks compared to existing methods.

Authors:Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, Tao Tan
Title: Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark
Abstract:
Absolute Visual Localization (AVL) enables Unmanned Aerial Vehicle (UAV) to determine its position in GNSS-denied environments by establishing geometric relationships between UAV images and geo-tagged reference maps. While many previous works have achieved AVL with image retrieval and matching techniques, research in low-altitude multi-view scenarios still remains limited. Low-altitude Multi-view condition presents greater challenges due to extreme viewpoint changes. To explore the best UAV AVL approach in such condition, we proposed this benchmark. Firstly, a large-scale Low-altitude Multi-view dataset called AnyVisLoc was constructed. This dataset includes 18,000 images captured at multiple scenes and altitudes, along with 2.5D reference maps containing aerial photogrammetry maps and historical satellite maps. Secondly, a unified framework was proposed to integrate the state-of-the-art AVL approaches and comprehensively test their performance. The best combined method was chosen as the baseline and the key factors that influencing localization accuracy are thoroughly analyzed based on it. This baseline achieved a 74.1% localization accuracy within 5m under Low-altitude, Multi-view conditions. In addition, a novel retrieval metric called PDM@K was introduced to better align with the characteristics of the UAV AVL task. Overall, this benchmark revealed the challenges of Low-altitude, Multi-view UAV AVL and provided valuable guidance for future research. The dataset and codes are available at https://github.com/UAV-AVL/Benchmark
中文: 该基准针对低空多视角无人机绝对视觉定位任务,构建了大规模数据集并提出了统一评估框架,在5米误差范围内达到74.1%定位精度,同时揭示了该场景的特殊挑战并为后续研究提供了方向。
English: This benchmark introduces a large-scale dataset and unified framework for absolute visual localization in low-altitude multi-view UAV scenarios, achieving 74.1% accuracy within 5 meters while highlighting unique challenges and providing future research guidance.

Authors:Qiji Zhou, Yifan Gong, Guangsheng Bao, Hongjie Qiu, Jinqiang Li, Xiangrong Zhu, Huajian Zhang, Yue Zhang
Title: Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Abstract:
Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce \textbf{COVER} (\textbf{\underline{CO}}unterfactual \textbf{\underline{V}}id\textbf{\underline{E}}o \textbf{\underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments. Our work is available at https://github.com/gongyifan-hash/COVER-Benchmark.
中文: 本文提出的COVER基准通过结构化子问题分解系统评估多模态大模型在视频理解中的反事实推理能力,实验表明提升模型推理能力对增强视频理解鲁棒性至关重要。
English: The COVER benchmark is introduced to systematically evaluate multimodal large language models' counterfactual reasoning abilities in video understanding through structured sub-question decomposition, revealing that enhanced reasoning capabilities are crucial for robust performance.

Authors:Khawar Islam, Naveed Akhtar
Title: Context-guided Responsible Data Augmentation with Diffusion Models
Abstract:
Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at https://github.com/khawar-islam/DiffCoRe-Mix
中文:DiffCoRe-Mix方法通过约束性扩散和语义过滤系统性地融合自然与生成图像,在保持计算效率的同时将视觉模型准确率最高提升3%。
English: The proposed DiffCoRe-Mix method enhances vision model training by systematically combining natural and generative images through constrained diffusion and semantic filtering, achieving up to 3% accuracy improvement with minimal computational overhead.

Authors:Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
Abstract:
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.
中文: 本文提出了视频参考抠图新任务,通过利用视频扩散模型和新型潜在构造损失,根据参考描述生成时间一致的指定实例阿尔法蒙版,并发布了包含1万个视频的大规模数据集。
English: This paper introduces video referring matting, a novel task that generates temporally consistent alpha mattes for specified instances using referring captions by leveraging video diffusion models and a new Latent-Constructive loss, supported by a large-scale dataset of 10,000 videos.

Authors:Yang Xiao, Wang Lu, Jie Ji, Ruimeng Ye, Gen Li, Xiaolong Ma, Bo Hui
Title: Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
Abstract:
The design of artificial neural networks (ANNs) is inspired by the structure of the human brain, and in turn, ANNs offer a potential means to interpret and understand brain signals. Existing methods primarily align brain signals with stimulus signals using Mean Squared Error (MSE), which focuses only on local point-wise alignment and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. Specifically, we construct a transport plan between brain voxel embeddings and image embeddings, enabling more precise matching. By controlling the amount of transport, we mitigate the influence of redundant information. We apply our alignment model directly to the Brain Captioning task by feeding brain signals into a large language model (LLM) instead of images. Our approach achieves state-of-the-art performance across ten evaluation metrics, surpassing the previous best method by an average of 6.11\% in single-subject training and 3.81\% in cross-subject training. Additionally, we have uncovered several insightful conclusions that align with existing brain research. We unveil the redundancy and synergy of brain information processing through region masking and data dimensionality reduction visualization experiments. We believe our approach paves the way for a more precise understanding of brain signals in the future. The code is available at https://github.com/NKUShaw/OT-Alignment4brain-to-image.
中文总结:本文提出了一种基于最优传输的方法,用于对齐脑信号与图像嵌入,通过全局匹配和减少冗余信息,在脑信号解码和描述任务中超越了传统均方误差方法,取得了领先的性能。
English Summary: This paper introduces an optimal transport-based method for aligning brain signals with image embeddings, which outperforms traditional MSE approaches by enabling global matching and reducing redundancy, achieving state-of-the-art results in brain signal decoding and captioning tasks.

Authors:Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, Liang Lin
Title: RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
Abstract:
Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See https://github.com/MilkThink-Lab/RouterEval for all data, code and tutorial.
中文: 路由大语言模型通过从候选池中选择最优模型来提升性能,新推出的RouterEval基准测试揭示了该方法仍有巨大改进空间,并为路由器的开发提供了海量数据支持。
English: Routing large language models enhances performance by selecting the best model from a pool, with the new RouterEval benchmark revealing significant improvement potential and providing extensive data for router development.

Authors:Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
Title: GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Abstract:
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.
中文: GoT框架通过语言推理链分析语义空间关系,将图像生成与编辑转化为推理引导的过程,在实现精确用户控制的同时显著超越了基线方法的性能。
English: The GoT framework introduces a reasoning-guided approach to image generation and editing by analyzing semantic-spatial relationships through explicit language reasoning chains before producing images, significantly outperforming baseline methods and enabling precise user control.

Authors:Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Title: A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
Abstract:
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.
中文: 现有针对黑盒商业大视觉语言模型的定向攻击常因扰动缺乏语义细节而失败,而我们的方法通过局部裁剪和对齐将对抗性修改集中于语义丰富区域,显著提升了迁移性,在GPT-4.5和Claude-3.7等模型上成功率超90%。
English: Current targeted attacks on black-box commercial large vision-language models often fail due to their uniform perturbations lacking semantic clarity, but our method enhances transferability by focusing adversarial modifications on semantically rich regions through local cropping and alignment, achieving over 90% success on models like GPT-4.5 and Claude-3.7.

Authors:Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
Title: Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology
Abstract:
Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at https://github.com/HashmatShadab/HSAT.
中文: HSAT提出了一种分层自监督对抗训练方法,通过利用多级对比学习来增强组织病理学图像的鲁棒性,在白盒和黑盒场景下均取得了显著性能提升。
English: HSAT introduces hierarchical self-supervised adversarial training to enhance robustness in histopathology images by leveraging multi-level contrastive learning, achieving significant gains in white-box and black-box settings.

Authors:Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu
Title: ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Abstract:
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics. We will release the code and models soon for research purposes at https://boqian-li.github.io/ETCH/.
中文: ETCH通过等变紧密度拟合将衣物表面映射至人体,在多样姿态和服装类型下显著提升了人体拟合精度与泛化能力。
English: ETCH is a novel pipeline that uses equivariant tightness fitting to map cloth surfaces to underlying bodies, significantly improving body fitting accuracy and generalization across diverse poses and clothing types.

Authors:Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan
Title: DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Abstract:
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.
Chinese: 大型多模态模型在视觉问答任务中表现出色,但在自动驾驶的复杂多步推理上存在不足,为此我们提出了DriveLMM-o1数据集和模型,以提升自动驾驶场景中的推理能力和准确性。
English: Large multimodal models excel in Visual Question Answering but struggle with complex multi-step reasoning in autonomous driving, prompting the introduction of DriveLMM-o1, a dataset and model that enhances reasoning and accuracy in driving scenarios.

Authors:Jinyang Li, En Yu, Sijia Chen, Wenbing Tao
Title: OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer
Abstract:
Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at https://github.com/jinyanglii/OVTR.
中文摘要:OVTR是首个端到端的开放词汇多目标跟踪器,通过类别信息传播策略和深度多模态交互设计,在保持高效推理的同时显著提升了未知类别的跟踪性能。
English Summary: OVTR is the first end-to-end open-vocabulary multiple object tracker that simultaneously models motion, appearance, and categories, achieving superior performance and faster inference through innovative strategies like category information propagation and deep multimodal interaction.

Authors:Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen
Title: R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Abstract:
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
Chinese: R1-Onevision多模态推理模型通过跨模态管道将图像转化为文本表示,弥合了视觉感知与深度推理之间的鸿沟,在多个复杂基准测试中超越GPT-4o等模型,实现了最先进的性能。
English: The R1-Onevision multimodal reasoning model bridges visual and textual analysis through a cross-modal pipeline that converts images into textual representations, achieving state-of-the-art performance on complex benchmarks by outperforming models like GPT-4o.

Authors:Severin Heidrich, Till Beemelmanns, Alexey Nekrasov, Bastian Leibe, Lutz Eckstein
Title: OCCUQ: Exploring Efficient Uncertainty Quantification for 3D Occupancy Prediction
Abstract:
Autonomous driving has the potential to significantly enhance productivity and provide numerous societal benefits. Ensuring robustness in these safety-critical systems is essential, particularly when vehicles must navigate adverse weather conditions and sensor corruptions that may not have been encountered during training. Current methods often overlook uncertainties arising from adversarial conditions or distributional shifts, limiting their real-world applicability. We propose an efficient adaptation of an uncertainty estimation technique for 3D occupancy prediction. Our method dynamically calibrates model confidence using epistemic uncertainty estimates. Our evaluation under various camera corruption scenarios, such as fog or missing cameras, demonstrates that our approach effectively quantifies epistemic uncertainty by assigning higher uncertainty values to unseen data. We introduce region-specific corruptions to simulate defects affecting only a single camera and validate our findings through both scene-level and region-level assessments. Our results show superior performance in Out-of-Distribution (OoD) detection and confidence calibration compared to common baselines such as Deep Ensembles and MC-Dropout. Our approach consistently demonstrates reliable uncertainty measures, indicating its potential for enhancing the robustness of autonomous driving systems in real-world scenarios. Code and dataset are available at https://github.com/ika-rwth-aachen/OCCUQ .
中文: 本研究提出了一种用于自动驾驶中3D占据预测的高效不确定性估计方法,通过动态校准模型置信度,在相机损坏等恶劣条件下展现出优异的异常数据检测能力和系统鲁棒性提升。
English: This study introduces an efficient uncertainty estimation method for 3D occupancy prediction in autonomous driving, which dynamically calibrates model confidence and demonstrates superior performance in detecting out-of-distribution data and improving system robustness under adverse conditions like camera corruptions.

Authors:Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Title: TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Abstract:
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at https://github.com/jinhaoduan/TruthPrInt.
Chinese: 物体幻觉是大视觉语言模型中的主要可信挑战,但内部状态可作为逐令牌指示器并揭示通用模式,由此提出的TruthPrInt方法通过真实性引导干预显著优于现有方法。
English: Object Hallucination is a major trust issue in Large Vision-Language Models, but internal states can serve as per-token indicators and reveal universal patterns, leading to the proposed TruthPrInt method that significantly outperforms existing approaches through truthful-guided intervention.

Authors:Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
Title: GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Abstract:
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than GLaMM.
中文: 针对像素接地任务中现有数据集在对象类别、文本多样性和标注质量上的局限,本研究提出了GroundingSuite自动化数据标注框架,构建了大规模训练数据集和精选评估基准,不仅显著提升了模型性能达到最优水平,其标注效率也比现有领先方法快4.5倍。
English: To address the limitations of existing datasets in pixel grounding, such as restricted object categories and insufficient annotations, this study introduces GroundingSuite—an automated data annotation framework that generates a large-scale training dataset and a curated evaluation benchmark, achieving state-of-the-art performance and significantly faster annotation efficiency compared to current methods.

Authors:Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert
Title: ASIDE: Architectural Separation of Instructions and Data in Language Models
Abstract:
Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.
中文摘要:本研究提出ASIDE架构改进方案,通过对数据令牌嵌入进行正交旋转实现指令与数据的清晰分离,在不影响模型性能的前提下显著增强语言模型对提示注入攻击的防御能力。
English Summary: The study introduces ASIDE, an architectural enhancement that improves language model security by orthogonally rotating data token embeddings to clearly separate instructions from data, thereby increasing robustness against prompt injection attacks without compromising performance.

Authors:Quoc-Tien Nguyen, Hong-Hai Nguyen, Van-Thong Huynh
Title: Lightweight Models for Emotional Analysis in Video
Abstract:
In this study, we present an approach for efficient spatiotemporal feature extraction using MobileNetV4 and a multi-scale 3D MLP-Mixer-based temporal aggregation module. MobileNetV4, with its Universal Inverted Bottleneck (UIB) blocks, serves as the backbone for extracting hierarchical feature representations from input image sequences, ensuring both computational efficiency and rich semantic encoding. To capture temporal dependencies, we introduce a three-level MLP-Mixer module, which processes spatial features at multiple resolutions while maintaining structural integrity. Experimental results on the ABAW 8th competition demonstrate the effectiveness of our approach, showing promising performance in affective behavior analysis. By integrating an efficient vision backbone with a structured temporal modeling mechanism, the proposed framework achieves a balance between computational efficiency and predictive accuracy, making it well-suited for real-time applications in mobile and embedded computing environments.
中文: 本研究提出一种高效框架,结合MobileNetV4进行空间特征提取和多尺度3D MLP-Mixer进行时序建模,在ABAW第八届竞赛中验证了其在情感行为分析中兼顾计算效率与预测精度的优势。
English: This study introduces an efficient framework combining MobileNetV4 for spatial feature extraction with a multi-scale 3D MLP-Mixer for temporal modeling, achieving balanced computational efficiency and accuracy in affective behavior analysis as validated on the ABAW 8th competition.

Authors:Zengrong Lin, Zheng Wang, Tianwen Qian, Pan Mu, Sixian Chan, Cong Bai
Title: NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Abstract:
Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications. We make our code publicly available at: https://github.com/zzezze/NeighborRetr .
中文: NeighborRetr是一种新颖的训练方法,通过自适应平衡中心点学习和邻居关系,直接缓解跨模态检索中的中心点问题,实现了最优性能并展现出强大的跨领域泛化能力。
English: NeighborRetr is a novel training method that directly mitigates the hubness problem in cross-modal retrieval by adaptively balancing hub learning and neighbor relations, achieving state-of-the-art performance and robust generalization across domains.

Authors:Florian Eichin, Yang Janet Liu, Barbara Plank, Michael A. Hedderich
Title: Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Abstract:
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
Chinese: 本研究探讨大型语言模型如何跨语言和框架捕获可泛化的话语知识,发现多语言模型在跨语言话语关系分类中表现优异,其中间层展现出最强的泛化能力。
English: This study explores how large language models (LLMs) capture generalizable discourse knowledge across languages and frameworks, finding that multilingual models excel in cross-lingual discourse relation classification with intermediate layers showing the strongest generalization.

Authors:Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, Tao Chen
Title: TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.
Chinese: TokenCarve是一种无需训练的令牌压缩框架,通过减少注意力输出中的信息损失来保持性能,在显著提升效率的同时仅造成微小的精度下降。
English: TokenCarve is a training-free token compression framework that preserves performance by minimizing information loss in attention outputs, achieving significant efficiency gains with minimal accuracy drop.

Authors:Jiali Yao, Xinran Deng, Xin Gu, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang
Title: OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
Abstract:
In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.
中文: 本文提出OmniSTVG,一种新颖的时空视频定位任务,可在视频中同时定位文本查询中的所有目标对象,并发布了首个大规模基准BOSTVG和简单有效的OmniTube方法,为视频理解开辟了新方向。
English: This paper introduces OmniSTVG, a novel spatio-temporal video grounding task that localizes all objects mentioned in text queries across both space and time in videos, along with BOSTVG, the first large-scale benchmark for this task, and a simple yet effective method called OmniTube that shows promising results.

Authors:Eirik Høyheim, Lars Skaaret-Lund, Solve Sæbø, Aliaksandr Hubin
Title: Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks
Abstract:
Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing for predictive uncertainty evaluation. Latent binary Bayesian neural networks (LBBNNs) further handle structural uncertainty and sparsify models by removing redundant weights. This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded, simplifying networks and clarifying input impacts on predictions. Ultimately, a linear model or even a constant can be found to be optimal for a specific problem at hand. Furthermore, the input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones, while still maintaining high predictive accuracy and uncertainty measurement. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks. Furthermore, the proposed method accurately identifies the true covariates and adjusts for system non-linearity. The main contribution is the introduction of active paths, enhancing directly designed global and local explanations within the LBBNN framework, that have theoretical guarantees and do not require post hoc external tools for explanations.
中文: 本文提出一种改进的潜在二元贝叶斯神经网络(LBBNN),允许协变量跨层连接或排除,在保持高精度和不确定性量化的同时实现超过99%的网络稀疏度,并提供无需外部工具的理论可解释性。
English: The article introduces an enhanced Latent Binary Bayesian Neural Network (LBBNN) that allows covariates to skip layers or be excluded, achieving over 99% network sparsity while maintaining high accuracy and uncertainty quantification, and providing built-in theoretical explanations without external tools.

Authors:Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
Title: Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Abstract:
This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
中文: 本文介绍Light-R1开源框架,它采用经济高效的课程学习方法和公开数据训练长推理模型,在数学推理上达到先进水平并展现强大跨领域泛化能力。
English: This paper presents Light-R1, an open-source framework for training long reasoning models using a cost-effective curriculum approach with public data, achieving state-of-the-art performance in math reasoning and strong cross-domain generalization.

Authors:Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, Kaidi Xu
Title: DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
Abstract:
The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code. Our benchmark and evaluation code are available at https://github.com/HWH-2000/DynaCode.
中文: DynaCode 是一种动态、复杂度感知的代码基准,通过结合代码复杂度和调用图结构来系统评估大语言模型,有效克服静态数据集的局限性,揭示了模型性能显著下降并深入解析其行为特征。
English: DynaCode is a dynamic, complexity-aware benchmark designed to address the limitations of static code datasets by evaluating large language models using code complexity and call-graph structures, revealing significant performance drops and providing insights into model behavior.

Authors:Yuwen Du, Anning Hu, Zichen Chao, Yifan Lu, Junhao Ge, Genjia Liu, Weitao Wu, Lanjun Wang, Siheng Chen
Title: RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation
Abstract:
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim
中文: RoCo-Sim是首个用于路边协同感知的仿真框架,通过动态前景编辑和全场景风格迁移生成多样且多视角一致的数据,显著提升了三维物体检测性能。
English: RoCo-Sim is the first simulation framework designed to enhance roadside collaborative perception by generating diverse, multi-view consistent data through dynamic foreground editing and full-scene style transfer, significantly improving 3D object detection performance.

Authors:Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao
Title: RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Abstract:
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/realgeneral_web/; GitHub Link: https://github.com/Lyne1/RealGeneral
中文: RealGeneral提出了一种新颖框架,通过将图像生成任务重构为基于视频模型的帧预测任务,实现了多任务统一,并在主体相似性和图像质量方面取得显著提升。
English: RealGeneral introduces a novel framework that unifies diverse image generation tasks by reformulating them as conditional frame prediction using video models, achieving significant improvements in subject similarity and image quality.

Authors:Matteo Gambella, Fabrizio Pittorino, Manuel Roveri
Title: Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search
Abstract:
Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.
中文: 本文提出架构感知最小化(A²M)这一创新框架,通过引导梯度朝向架构空间中的平坦最小值来增强神经架构搜索,在多个基准数据集上显著提升了泛化性能。
English: This paper introduces Architecture-Aware Minimization (A²M), a novel framework that enhances neural architecture search by guiding gradients toward flat minima in architecture space, significantly improving generalization across benchmark datasets.

Authors:Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang
Title: RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing
Abstract:
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
中文:RoMA框架通过旋转感知预训练机制和多尺度标记预测,实现了基于Mamba的遥感基础模型的可扩展自监督预训练,在精度和计算效率上持续优于基于ViT的模型。
English: The RoMA framework enables scalable self-supervised pretraining of Mamba-based remote sensing foundation models, overcoming ViT's quadratic complexity through rotation-aware mechanisms and multi-scale token prediction, consistently outperforming ViT counterparts in accuracy and efficiency.

Authors:Shuvro Chowdhury, Navid Anjum Aadit, Andrea Grimaldi, Eleonora Raimondo, Atharva Raut, P. Aaron Lott, Johan H. Mentink, Marek M. Rams, Federico Ricci-Tersenghi, Massimo Chiappini, Luke S. Theogarajan, Tathagata Srimani, Giovanni Finocchio, Masoud Mohseni, Kerem Y. Camsari
Title: Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers
Abstract:
Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers), when co-designed with hardware to implement powerful Monte Carlo algorithms, provide a compelling and scalable classical pathway for solving hard optimization problems. We focus on two key algorithms applied to 3D spin glasses: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT). We benchmark these methods against the performance of a leading quantum annealer on the same problem instances. For DT-SQA, we find that increasing the number of replicas improves residual energy scaling, in line with expectations from extreme value theory. We then show that APT, when supported by non-local isoenergetic cluster moves, exhibits a more favorable scaling and ultimately outperforms DT-SQA. We demonstrate these algorithms are readily implementable in modern hardware, projecting that custom Field Programmable Gate Arrays (FPGA) or specialized chips can leverage massive parallelism to accelerate these algorithms by orders of magnitude while drastically improving energy efficiency. Our results establish a new, rigorous classical baseline, clarifying the landscape for assessing a practical quantum advantage and presenting p-computers as a scalable platform for real-world optimization challenges.
中文: 概率计算机与硬件协同设计,通过实现强大的蒙特卡洛算法,为复杂优化问题提供了可扩展的经典解决方案,其性能超越量子退火机,为评估实际量子优势设立了新的基准。
English: P-computers, co-designed with hardware to implement Monte Carlo algorithms, offer a scalable classical solution for hard optimization problems, outperforming quantum annealers and setting a new benchmark for assessing practical quantum advantage.

Authors:Zhen Zhang, Meihan Liu, Bingsheng He
Title: PyGDA: A Python Library for Graph Domain Adaptation
Abstract:
Graph domain adaptation has emerged as a promising approach to facilitate knowledge transfer across different domains. Recently, numerous models have been proposed to enhance their generalization capabilities in this field. However, there is still no unified library that brings together existing techniques and simplifies their implementation. To fill this gap, we introduce PyGDA, an open-source Python library tailored for graph domain adaptation. As the first comprehensive library in this area, PyGDA covers more than 20 widely used graph domain adaptation methods together with different types of graph datasets. Specifically, PyGDA offers modular components, enabling users to seamlessly build custom models with a variety of commonly used utility functions. To handle large-scale graphs, PyGDA includes support for features such as sampling and mini-batch processing, ensuring efficient computation. In addition, PyGDA also includes comprehensive performance benchmarks and well-documented user-friendly API for both researchers and practitioners. To foster convenient accessibility, PyGDA is released under the MIT license at https://github.com/pygda-team/pygda, and the API documentation is https://pygda.readthedocs.io/en/stable/.
中文: PyGDA作为首个开源图域自适应综合库,集成了20多种主流方法并提供模块化组件与可扩展功能,填补了该领域统一工具库的空白。
English: PyGDA is an open-source Python library that unifies over 20 graph domain adaptation methods with modular components and scalable features to bridge the gap in standardized implementations for cross-domain knowledge transfer.

Authors:Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang
Title: EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
Abstract:
Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named EEdit, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. For spatial redundancy, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. For temporal redundancy, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of 2.46 $\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available at https://github.com/yuriYanZeXuan/EEdit
中文:EEdit框架通过空间局部缓存和反转步骤跳跃技术,解决了基于反转的图像编辑中的空间与时间冗余问题,在多种编辑任务中实现了2.46倍加速且性能无损。
English: The EEdit framework addresses spatial and temporal redundancies in inversion-based image editing through spatial locality caching and inversion step skipping, achieving a 2.46× speedup without performance loss across various editing tasks.

Authors:Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, Jian Wang
Title: KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception
Abstract:
Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing global quality and can guide us in adopting fine-grained enhancement or transcoding strategies. Due to the heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. Our framework extracts visual saliency and allocates attention using Fusion-Window Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. KVQ obtains significant improvements across multiple scenarios on five VQA benchmarks compared to SOTA methods. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with region-wise annotations. Experimental results demonstrate the capability of KVQ in perceiving local distortions. KVQ models and the LPVQ dataset will be available at https://github.com/qyp2000/KVQ.
中文: KVQ框架通过融合窗口注意力和局部感知约束,结合视觉显著性与局部纹理分析来提升视频质量评估,在多个基准测试中表现优异,并发布了用于局部感知评估的新数据集。
English: The KVQ framework enhances video quality assessment by integrating visual saliency and local texture analysis through Fusion-Window Attention and Local Perception Constraint, achieving superior performance on benchmarks and introducing a new dataset for local distortion evaluation.

Authors:Zhi Chen, Zecheng Zhao, Jingcai Guo, Jingjing Li, Zi Huang
Title: SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning
Abstract:
Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations. Code is available at https://github.com/uqzhichen/SVIP.
中文: SVIP是一种基于Transformer的框架,通过在输入阶段预先识别并替换语义无关的图像块来增强零样本学习中的视觉-语义对齐,实现了最先进的性能并提供了更具解释性的特征表示。
English: SVIP is a transformer-based framework that enhances zero-shot learning by preemptively identifying and replacing semantic-unrelated patches in the input stage, achieving state-of-the-art performance with more interpretable feature representations.

Authors:Zhijie Zhu, Lei Fan, Maurice Pagnucco, Yang Song
Title: Interpretable Image Classification via Non-parametric Part Prototype Learning
Abstract:
Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: https://github.com/zijizhu/proto-non-param.
中文: 本文提出了一种基于部件可解释图像分类框架,通过深度特征的非参数聚类学习具有语义区分度的物体部件,在提升解释多样性的同时,在可解释性评估指标上优于现有方法。
English: This paper introduces a framework for interpretable image classification that learns distinctive object parts through non-parametric clustering of deep features, providing diverse explanations while outperforming existing methods in interpretability metrics.

Authors:Julian Schelb, Orr Borin, David Garcia, Andreas Spitz
Title: R.U.Psycho? Robust Unified Psychometric Testing of Language Models
Abstract:
Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present R.U.Psycho, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. R.U.Psycho is available as a Python package at https://github.com/julianschelb/rupsycho.
中文摘要:本文提出了R.U.Psycho框架,旨在提高生成语言模型心理测量实验的稳健性和可重复性,通过多种问卷验证了先前研究结果,并解决了输出不稳定性和提示敏感性等挑战。
English Summary: The paper introduces R.U.Psycho, a framework designed to enhance the robustness and reproducibility of psychometric experiments on generative language models, addressing challenges like output instability and prompt sensitivity while validating prior findings through various questionnaires.

Authors:Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang
Title: CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
Abstract:
Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage.Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy.Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at https://github.com/kk42yy/CoStoDet-DDPM.
Chinese: 本文提出了一种结合去噪扩散概率模型(DDPM)随机建模与确定性学习的协同训练框架,显著提升了手术工作流分析的预测和识别性能,在基准数据集上实现了关键指标突破并保持实时推理能力。
English: This paper introduces a collaborative co-training framework that integrates stochastic modeling via a denoising diffusion probabilistic model (DDPM) with deterministic learning to enhance surgical workflow analysis, achieving significant improvements in anticipation and recognition tasks on benchmark datasets while ensuring real-time inference.

Authors:Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang
Title: LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Abstract:
Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.
中文: 现有MLLM在长视频时序建模上存在局限,而LVAgent通过多智能体动态协作框架,在四大主流任务中实现80%准确率,并在LongVideoBench上提升13.3%,性能超越所有开源与闭源模型。
English: Current MLLMs struggle with temporal context in long videos, but LVAgent introduces a multi-agent collaboration framework that outperforms existing models by 13.3% on benchmarks through dynamic team optimization and iterative reasoning.

Authors:Jianheng Liu, Yunfei Wan, Bowen Wang, Chunran Zheng, Jiarong Lin, Fu Zhang
Title: GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction
Abstract:
Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field. This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes are released at https://github.com/hku-mars/GS-SDF.
中文摘要:该研究提出的统一激光雷达视觉系统将高斯泼溅与神经符号距离场相结合,通过基于SDF的初始化和几何正则化方法,有效解决了自动驾驶应用中几何不一致的问题,实现了卓越的重建精度与渲染质量。
English Summary: The proposed unified LiDAR-visual system integrates Gaussian splatting with neural signed distance fields to overcome geometric inconsistencies in autonomous driving applications, achieving superior reconstruction and rendering through SDF-based initialization and geometric regularization.

Authors:Thomas Sanchez, Vladyslav Zalevskyi, Angeline Mihailov, Gerard Martí-Juan, Elisenda Eixarch, Andras Jakab, Vincent Dunet, Mériam Koob, Guillaume Auzias, Meritxell Bach Cuadra
Title: Automatic quality control in multi-centric fetal brain MRI super-resolution reconstruction
Abstract:
Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where acquisitions and image processing techniques are less standardized than in adult imaging. In this work, we focus on automated quality control of super-resolution reconstruction (SRR) volumes of fetal brain MRI, an important processing step where multiple stacks of thick 2D slices are registered together and combined to build a single, isotropic and artifact-free T2 weighted volume. We propose FetMRQC$_{SR}$, a machine-learning method that extracts more than 100 image quality metrics to predict image quality scores using a random forest model. This approach is well suited to a problem that is high dimensional, with highly heterogeneous data and small datasets. We validate FetMRQC$_{SR}$ in an out-of-domain (OOD) setting and report high performance (ROC AUC = 0.89), even when faced with data from an unknown site or SRR method. We also investigate failure cases and show that they occur in $45\%$ of the images due to ambiguous configurations for which the rating from the expert is arguable. These results are encouraging and illustrate how a non deep learning-based method like FetMRQC$_{SR}$ is well suited to this multifaceted problem. Our tool, along with all the code used to generate, train and evaluate the model are available at https://github.com/Medical-Image-Analysis-Laboratory/fetmrqc_sr/ .
中文: 本研究提出了FetMRQC$_{SR}$,一种利用100多个图像指标的机器学习工具,用于自动评估胎儿脑部超分辨率MRI重建的质量,在跨域测试中表现出色,有效应对了这一复杂领域的挑战。
English: This study introduces FetMRQC$_{SR}$, a machine-learning tool using over 100 image metrics to automatically assess the quality of super-resolution fetal brain MRI reconstructions, achieving high out-of-domain performance and addressing challenges in this complex field.

Authors:Zhenxuan Zeng, Qiao Wu, Xiyu Zhang, Lin Yuanbo Wu, Pei An, Jiaqi Yang, Ji Wang, Peng Wang
Title: Unlocking Generalization Power in LiDAR Point Cloud Registration
Abstract:
In real-world environments, a LiDAR point cloud registration method with robust generalization capabilities (across varying distances and datasets) is crucial for ensuring safety in autonomous driving and other LiDAR-based applications. However, current methods fall short in achieving this level of generalization. To address these limitations, we propose UGP, a pruned framework designed to enhance generalization power for LiDAR point cloud registration. The core insight in UGP is the elimination of cross-attention mechanisms to improve generalization, allowing the network to concentrate on intra-frame feature extraction. Additionally, we introduce a progressive self-attention module to reduce ambiguity in large-scale scenes and integrate Bird's Eye View (BEV) features to incorporate semantic information about scene elements. Together, these enhancements significantly boost the network's generalization performance. We validated our approach through various generalization experiments in multiple outdoor scenes. In cross-distance generalization experiments on KITTI and nuScenes, UGP achieved state-of-the-art mean Registration Recall rates of 94.5% and 91.4%, respectively. In cross-dataset generalization from nuScenes to KITTI, UGP achieved a state-of-the-art mean Registration Recall of 90.9%. Code will be available at https://github.com/peakpang/UGP.
中文: 提出的UGP框架通过消除交叉注意力机制并引入渐进式自注意力与鸟瞰图特征,显著提升了激光雷达点云配准的泛化能力,在跨距离和跨数据集实验中均取得了最优的召回率。
English: The proposed UGP framework enhances LiDAR point cloud registration generalization by removing cross-attention and incorporating progressive self-attention with BEV features, achieving state-of-the-art recall rates in cross-distance and cross-dataset experiments.

Authors:Linzuo Zhang, Yu Hu, Yang Deng, Feng Yu, Danping Zou
Title: Mapless Collision-Free Flight via MPC using Dual KD-Trees in Cluttered Environments
Abstract:
Collision-free flight in cluttered environments is a critical capability for autonomous quadrotors. Traditional methods often rely on detailed 3D map construction, trajectory generation, and tracking. However, this cascade pipeline can introduce accumulated errors and computational delays, limiting flight agility and safety. In this paper, we propose a novel method for enabling collision-free flight in cluttered environments without explicitly constructing 3D maps or generating and tracking collision-free trajectories. Instead, we leverage Model Predictive Control (MPC) to directly produce safe actions from sparse waypoints and point clouds from a depth camera. These sparse waypoints are dynamically adjusted online based on nearby obstacles detected from point clouds. To achieve this, we introduce a dual KD-Tree mechanism: the Obstacle KD-Tree quickly identifies the nearest obstacle for avoidance, while the Edge KD-Tree provides a robust initial guess for the MPC solver, preventing it from getting stuck in local minima during obstacle avoidance. We validate our approach through extensive simulations and real-world experiments. The results show that our approach significantly outperforms the mapping-based methods and is also superior to imitation learning-based methods, demonstrating reliable obstacle avoidance at up to 12 m/s in simulations and 6 m/s in real-world tests. Our method provides a simple and robust alternative to existing methods. The code is publicly available at https://github.com/SJTU-ViSYS-team/avoid-mpc.
中文摘要:本文提出了一种新颖的模型预测控制方法,通过深度相机点云和稀疏路径点直接生成安全动作,无需构建三维地图或规划轨迹,即可实现四旋翼无人机在复杂环境中的无碰撞飞行。
English Summary: This paper introduces a novel Model Predictive Control (MPC) method that enables autonomous quadrotors to achieve collision-free flight in cluttered environments by directly generating safe actions from sparse waypoints and depth camera point clouds, eliminating the need for explicit 3D mapping or trajectory planning.

Authors:Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
Title: Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Abstract:
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
中文: Gumiho是一种混合推测解码模型,结合了串行与并行头部,通过为早期令牌使用复杂Transformer提升准确性,后期令牌采用轻量级MLP提高效率,从而在性能上超越现有方法。
English: Gumiho is a hybrid speculative decoding model that combines serial and parallel heads, using a sophisticated Transformer for early tokens to boost accuracy and lightweight MLPs for later tokens to enhance efficiency, outperforming existing methods.

Authors:Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan
Title: MoEdit: On Learning Quantity Perception for Multi-object Image Editing
Abstract:
Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.
中文摘要:MoEdit是一种无需辅助工具的多对象图像编辑框架,通过特征补偿模块和数量注意力模块确保对象间的区分度与数量一致性,在多对象风格迁移和背景重建等任务中实现了最先进的编辑效果。
English Summary: MoEdit is a novel framework that enables high-quality multi-object image editing by introducing Feature Compensation and Quantity Attention modules to maintain object distinction and quantity consistency without auxiliary tools, achieving state-of-the-art performance.

Authors:Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen
Title: Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing
Abstract:
Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.
中文摘要:本文提出了一种名为FrameFusionMoE的新型连续文本-视频检索框架,通过其帧融合适配器和任务感知专家混合组件,在适应新视频内容的同时有效防止灾难性遗忘,相比现有方法展现出更优越的检索性能。
English Summary: This paper introduces a novel Continual Text-to-Video Retrieval framework called FrameFusionMoE, which effectively adapts to new video content while preventing catastrophic forgetting through its Frame Fusion Adapter and Task-Aware Mixture-of-Experts components, demonstrating superior performance over existing methods.

Authors:Shu-Xun Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang, Jie Tang
Title: StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
Abstract:
Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.
中文: 本文提出StepMathAgent这一基于错误树的新型数学过程评估框架,通过分析解题步骤而非仅关注最终答案来改进现有大语言模型评估方法的不足,实验表明其在StepMathBench基准测试中表现出优越性能且与人类评估标准高度一致。
English: This paper introduces StepMathAgent, a novel mathematical process evaluation framework using Tree-of-Error methodology to address limitations in current LLM assessments by analyzing solution steps rather than just final answers, with experiments demonstrating superior performance and human-aligned evaluation on the StepMathBench benchmark.

Authors:Yuheng Liang, Zheyu Wang, Feng Liu, Mingzhou Liu, Yu Yao
Title: Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space
Abstract:
Continuous Emotion Recognition (CER) plays a crucial role in intelligent human-computer interaction, mental health monitoring, and autonomous driving. Emotion modeling based on the Valence-Arousal (VA) space enables a more nuanced representation of emotional states. However, existing methods still face challenges in handling long-term dependencies and capturing complex temporal dynamics. To address these issues, this paper proposes a novel emotion recognition model, Mamba-VA, which leverages the Mamba architecture to efficiently model sequential emotional variations in video frames. First, the model employs a Masked Autoencoder (MAE) to extract deep visual features from video frames, enhancing the robustness of temporal information. Then, a Temporal Convolutional Network (TCN) is utilized for temporal modeling to capture local temporal dependencies. Subsequently, Mamba is applied for long-sequence modeling, enabling the learning of global emotional trends. Finally, a fully connected (FC) layer performs regression analysis to predict continuous valence and arousal values. Experimental results on the Valence-Arousal (VA) Estimation task of the 8th competition on Affective Behavior Analysis in-the-wild (ABAW) demonstrate that the proposed model achieves valence and arousal scores of 0.5362 (0.5036) and 0.4310 (0.4119) on the validation (test) set, respectively, outperforming the baseline. The source code is available on GitHub:https://github.com/FreedomPuppy77/Charon.
中文: 本文提出Mamba-VA模型,通过结合掩码自编码器、时序卷积网络和Mamba架构,有效捕捉视频中情感识别的局部与全局时序依赖关系,在效价-唤醒度估计任务上取得了优于基准的性能表现。
English: This paper introduces Mamba-VA, a novel model that combines Masked Autoencoder, Temporal Convolutional Network, and Mamba architecture to effectively capture both local and global temporal dependencies for continuous emotion recognition in videos, achieving superior performance on valence-arousal estimation tasks.

Authors:Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu
Title: Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation
Abstract:
Diffusion models have demonstrated remarkable performance in modeling complex data priors, catalyzing their widespread adoption in solving various inverse problems. However, the inherently iterative nature of diffusion-based inverse algorithms often requires hundreds to thousands of steps, with performance degradation occurring under fewer steps which limits their practical applicability. While high-order diffusion ODE solvers have been extensively explored for efficient diffusion sampling without observations, their application to inverse problems remains underexplored due to the diverse forms of inverse algorithms and their need for repeated trajectory correction based on observations. To address this gap, we first introduce a canonical form that decomposes existing diffusion-based inverse algorithms into three modules to unify their analysis. Inspired by the linear subspace search strategy in the design of high-order diffusion ODE solvers, we propose the Learnable Linear Extrapolation (LLE) method, a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm that fits the proposed canonical form. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.
中文摘要:本文提出了一种规范形式来统一基于扩散的逆问题算法,并设计了可学习线性外推(LLE)方法,能够普遍提升这些算法在有限步数下的性能表现。
English Summary: This paper introduces a canonical form to unify diffusion-based inverse algorithms and proposes the Learnable Linear Extrapolation (LLE) method to universally enhance their performance with limited computational steps.

Authors:Zhen Qu, Xian Tao, Xinyi Gong, Shichen Qu, Qiyu Chen, Zhengtao Zhang, Xingang Wang, Guiguang Ding
Title: Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Abstract:
Recently, vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD). By leveraging auxiliary data during training, these models can directly perform cross-category anomaly detection on target datasets, such as detecting defects on industrial product surfaces or identifying tumors in organ tissues. Existing approaches typically construct text prompts through either manual design or the optimization of learnable prompt vectors. However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limits generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. Specifically, a prompt flow module is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and improve the model's generalization on unseen categories. These learned distributions are then sampled to generate diverse text prompts, effectively covering the prompt space. Additionally, a residual cross-model attention (RCA) module is introduced to better align dynamic text embeddings with fine-grained image features. Extensive experiments on 15 industrial and medical datasets demonstrate our method's superior performance. The code is available at https://github.com/xiaozhen228/Bayes-PFL.
中文:近期视觉语言模型在零样本异常检测中表现突出,但面临提示设计的挑战;贝叶斯提示流学习通过将提示空间建模为可学习概率分布并引入残差跨模态注意力模块,在工业和医疗数据集上实现了卓越性能。
English: Recent vision-language models like CLIP show strong zero-shot anomaly detection capabilities but face challenges in prompt design, which Bayes-PFL addresses by modeling prompts as learnable distributions and introducing a residual attention module for better cross-modal alignment, achieving superior results across industrial and medical datasets.

Authors:Chunyi Li, Xiaozhe Li, Zicheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai
Title: Information Density Principle for MLLM Benchmarks
Abstract:
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks. Project page: https://github.com/lcysyzxdxc/bench4bench
Chinese: 本研究提出信息密度原则,从四个维度评估多模态大语言模型基准的有效性,分析了19个基准后发现,尽管新基准提供更多洞见,但其信息密度仍有提升空间。
English: The study introduces the principle of Information Density to evaluate the effectiveness of Multimodal Large Language Model benchmarks, analyzing 19 benchmarks across four dimensions and finding that while newer ones offer more insights, their information density still needs enhancement.

Authors:Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai
Title: Image Quality Assessment: From Human to Machine Preference
Abstract:
Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: https://github.com/lcysyzxdxc/MPD.
中文摘要:本文首次提出面向机器视觉的图像质量评估主题,建立了包含大量标注的机器偏好数据库(MPD),验证了现有以人为中心的图像质量评估标准无法准确表征机器偏好,旨在推动该领域从人类偏好向机器偏好的演进。
English Summary: This paper introduces the concept of Image Quality Assessment for Machine Vision, establishing a Machine Preference Database (MPD) to evaluate IQA algorithms based on machine task performance rather than human visual appeal, revealing that current human-centric metrics fail to accurately capture machine preferences.

Authors:Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
Title: VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Abstract:
Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at https://github.com/GD-AIGC/VMBench, setting a new standard for evaluating and advancing motion generation models.
中文:VMBench作为首个从人类感知角度评估视频运动质量的综合性基准,提出了感知对齐的运动指标和多样化运动提示生成方法,通过人工验证显著提升了评估相关性。
English: VMBench is introduced as a comprehensive video motion benchmark featuring perception-aligned metrics and diverse motion prompts to address current limitations in motion evaluation, validated by human preferences with significantly improved correlation.

Authors:Han Liu, Riqiang Gao, Sasa Grbic
Title: AI-assisted Early Detection of Pancreatic Ductal Adenocarcinoma on Contrast-enhanced CT
Abstract:
Pancreatic ductal adenocarcinoma (PDAC) is one of the most common and aggressive types of pancreatic cancer. However, due to the lack of early and disease-specific symptoms, most patients with PDAC are diagnosed at an advanced disease stage. Consequently, early PDAC detection is crucial for improving patients' quality of life and expanding treatment options. In this work, we develop a coarse-to-fine approach to detect PDAC on contrast-enhanced CT scans. First, we localize and crop the region of interest from the low-resolution images, and then segment the PDAC-related structures at a finer scale. Additionally, we introduce two strategies to further boost detection performance: (1) a data-splitting strategy for model ensembling, and (2) a customized post-processing function. We participated in the PANORAMA challenge and ranked 1st place for PDAC detection with an AUROC of 0.9263 and an AP of 0.7243. Our code and models are publicly available at https://github.com/han-liu/PDAC_detection.
Chinese: 本研究提出一种从粗到精的方法,通过模型集成和定制后处理策略,在CT扫描中检测胰腺导管腺癌(PDAC),并在PANORAMA挑战赛中取得最佳性能。
English: This study presents a coarse-to-fine approach for detecting pancreatic ductal adenocarcinoma (PDAC) on CT scans, achieving top performance in the PANORAMA challenge through model ensembling and customized post-processing.

Authors:Minje Kim, Minjun Kim, Xu Yang
Title: DTA: Dual Temporal-channel-wise Attention for Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs) present a more energy-efficient alternative to Artificial Neural Networks (ANNs) by harnessing spatio-temporal dynamics and event-driven spikes. Effective utilization of temporal information is crucial for SNNs, leading to the exploration of attention mechanisms to enhance this capability. Conventional attention operations either apply identical operation or employ non-identical operations across target dimensions. We identify that these approaches provide distinct perspectives on temporal information. To leverage the strengths of both operations, we propose a novel Dual Temporal-channel-wise Attention (DTA) mechanism that integrates both identical/non-identical attention strategies. To the best of our knowledge, this is the first attempt to concentrate on both the correlation and dependency of temporal-channel using both identical and non-identical attention operations. Experimental results demonstrate that the DTA mechanism achieves state-of-the-art performance on both static datasets (CIFAR10, CIFAR100, ImageNet-1k) and dynamic dataset (CIFAR10-DVS), elevating spike representation and capturing complex temporal-channel relationship. We open-source our code: https://github.com/MnJnKIM/DTA-SNN.
中文: 提出的双时序通道注意力机制通过融合相同与非相同注意力操作,有效提升了脉冲表征能力并捕获复杂时序通道关系,在多个数据集上实现了最优性能。
English: The proposed Dual Temporal-channel-wise Attention (DTA) mechanism synergistically combines identical and non-identical attention operations to enhance spike representation and capture complex temporal-channel relationships, achieving state-of-the-art performance across multiple datasets.

Authors:Bharat Srikishan, Daniel O'Malley, Mohamed Mehana, Nicholas Lubbers, Nikhil Muralidhar
Title: Model-Agnostic Knowledge Guided Correction for Improved Neural Surrogate Rollout
Abstract:
Modeling the evolution of physical systems is critical to many applications in science and engineering. As the evolution of these systems is governed by partial differential equations (PDEs), there are a number of computational simulations which resolve these systems with high accuracy. However, as these simulations incur high computational costs, they are infeasible to be employed for large-scale analysis. A popular alternative to simulators are neural network surrogates which are trained in a data-driven manner and are much more computationally efficient. However, these surrogate models suffer from high rollout error when used autoregressively, especially when confronted with training data paucity. Existing work proposes to improve surrogate rollout error by either including physical loss terms directly in the optimization of the model or incorporating computational simulators as `differentiable layers' in the neural network. Both of these approaches have their challenges, with physical loss functions suffering from slow convergence for stiff PDEs and simulator layers requiring gradients which are not always available, especially in legacy simulators. We propose the Hybrid PDE Predictor with Reinforcement Learning (HyPER) model: a model-agnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator (with or without gradients) to reduce surrogate rollout error significantly. In addition to reducing in-distribution rollout error by 47%-78%, HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption. Code available at https://github.com/scailab/HyPER.
中文:HyPER模型通过结合神经代理、强化学习和物理模拟器,显著降低了物理系统建模中的滚动误差,实现了47%-78%的误差减少,并能适应变化条件及抵抗噪声干扰。
English: The HyPER model combines neural surrogates with reinforcement learning and physics simulators to significantly reduce rollout errors in physical system modeling, achieving 47%-78% error reduction while adapting to changing conditions and resisting noise.

Authors:Wenjie Li, Heng Guo, Yuefeng Hou, Guangwei Gao, Zhanyu Ma
Title: Dual-domain Modulation Network for Lightweight Image Super-Resolution
Abstract:
Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images under limited computational costs. We find that existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show that introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a Dual-domain Modulation Network that integrates both wavelet and Fourier information for enhanced frequency modeling. Unlike existing methods that rely on a single frequency representation, our design combines wavelet-domain modulation via a Wavelet-domain Modulation Transformer (WMT) with global Fourier supervision, enabling complementary spectral learning well-suited for lightweight SR. Experimental results show that our method achieves a comparable PSNR to SRFormer and MambaIR while with less than 50\% and 60\% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Code link: https://github.com/24wenjie-li/DMNet
中文摘要:该研究提出的双域调制网络通过整合小波和傅里叶信息,在轻量化图像超分辨率任务中实现了高频细节与整体结构的平衡重建,在显著降低计算成本的同时获得了更优的恢复质量与推理速度。
English Summary: The proposed Dual-domain Modulation Network integrates wavelet and Fourier information to achieve superior lightweight image super-resolution, balancing high-frequency detail reconstruction with computational efficiency while outperforming existing methods in speed and resource usage.

Authors:Shiwon Kim, Dongjun Hwang, Sungwon Woo, Rita Singh
Title: Does Prior Data Matter? Exploring Joint Training in the Context of Few-Shot Class-Incremental Learning
Abstract:
Class-incremental learning (CIL) aims to adapt to continuously emerging new classes while preserving knowledge of previously learned ones. Few-shot class-incremental learning (FSCIL) presents a greater challenge that requires the model to learn new classes from only a limited number of samples per class. While incremental learning typically assumes restricted access to past data, it often remains available in many real-world scenarios. This raises a practical question: should one retrain the model on the full dataset (i.e., joint training), or continue updating it solely with new data? In CIL, joint training is considered an ideal benchmark that provides a reference for evaluating the trade-offs between performance and computational cost. However, in FSCIL, joint training becomes less reliable due to severe imbalance between base and incremental classes. This results in the absence of a practical baseline, making it unclear which strategy is preferable for practitioners. To this end, we revisit joint training in the context of FSCIL by incorporating imbalance mitigation techniques, and suggest a new imbalance-aware joint training benchmark for FSCIL. We then conduct extensive comparisons between this benchmark and FSCIL methods to analyze which approach is most suitable when prior data is accessible. Our analysis offers realistic insights and guidance for selecting training strategies in real-world FSCIL scenarios. Code is available at: https://github.com/shiwonkim/Joint_FSCIL
中文: 本研究通过引入不平衡缓解技术重新审视了少样本类增量学习中的联合训练,提出了新的基准来比较策略,并为实际场景中可获取历史数据时提供实用指导。
English: The study revisits joint training in few-shot class-incremental learning by incorporating imbalance mitigation techniques, proposing a new benchmark to compare strategies and provide practical guidance when prior data is accessible.

Authors:Shu Wang, Yanbo Gao, Shuai Li, Chong Lv, Xun Cai, Chuankun Li, Hui Yuan, Jinglin Zhang
Title: MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation
Abstract:
This paper presents MetricGrids, a novel grid-based neural representation that combines elementary metric grids in various metric spaces to approximate complex nonlinear signals. While grid-based representations are widely adopted for their efficiency and scalability, the existing feature grids with linear indexing for continuous-space points can only provide degenerate linear latent space representations, and such representations cannot be adequately compensated to represent complex nonlinear signals by the following compact decoder. To address this problem while keeping the simplicity of a regular grid structure, our approach builds upon the standard grid-based paradigm by constructing multiple elementary metric grids as high-order terms to approximate complex nonlinearities, following the Taylor expansion principle. Furthermore, we enhance model compactness with hash encoding based on different sparsities of the grids to prevent detrimental hash collisions, and a high-order extrapolation decoder to reduce explicit grid storage requirements. experimental results on both 2D and 3D reconstructions demonstrate the superior fitting and rendering accuracy of the proposed method across diverse signal types, validating its robustness and generalizability. Code is available at https://github.com/wangshu31/MetricGrids}{https://github.com/wangshu31/MetricGrids.
Chinese: 本文提出MetricGrids,一种基于网格的神经表示方法,通过组合多个基本度量网格来近似复杂非线性信号,并利用优化的哈希编码和高阶解码器在2D和3D重建中实现了卓越的精度。
English: This paper introduces MetricGrids, a grid-based neural representation that combines multiple elementary metric grids to approximate complex nonlinear signals, achieving superior accuracy in 2D and 3D reconstructions through enhanced hash encoding and a high-order decoder.

Authors:Zijian Zhao, Xuming Zhang, Jiayu Wen, Mingwen Liu, Xiaoteng Ma
Title: Label Unbalance in High-frequency Trading
Abstract:
In financial trading, return prediction is one of the foundation for a successful trading system. By the fast development of the deep learning in various areas such as graphical processing, natural language, it has also demonstrate significant edge in handling with financial data. While the success of the deep learning relies on huge amount of labeled sample, labeling each time/event as profitable or unprofitable, under the transaction cost, especially in the high-frequency trading world, suffers from serious label imbalance issue.In this paper, we adopts rigurious end-to-end deep learning framework with comprehensive label imbalance adjustment methods and succeed in predicting in high-frequency return in the Chinese future market. The code for our method is publicly available at https://github.com/RS2002/Label-Unbalance-in-High-Frequency-Trading .
中文: 本文采用严谨的端到端深度学习框架,结合全面的标签不平衡调整方法,成功实现了中国期货市场高频收益预测,解决了交易成本下标签不平衡的难题。
English: This paper introduces a rigorous end-to-end deep learning framework with comprehensive label imbalance adjustment methods to successfully predict high-frequency returns in the Chinese futures market, addressing the challenge of label imbalance under transaction costs.

Authors:Lin Tian, Sean I. Young, Jonathan Williams Ramirez, Dina Zemlyanker, Lucas Jacob Deden Binder, Rogeny Herisse, Theresa R. Connors, Derek H. Oakley, Bradley T. Hyman, Oula Puonti, Matthew S. Rosen, Juan Eugenio Iglesias
Title: Reference-Free 3D Reconstruction of Brain Dissection Photographs with Machine Learning
Abstract:
Correlation of neuropathology with MRI has the potential to transfer microscopic signatures of pathology to invivo scans. Recently, a classical registration method has been proposed, to build these correlations from 3D reconstructed stacks of dissection photographs, which are routinely taken at brain banks. These photographs bypass the need for exvivo MRI, which is not widely accessible. However, this method requires a full stack of brain slabs and a reference mask (e.g., acquired with a surface scanner), which severely limits the applicability of the technique. Here we propose RefFree, a dissection photograph reconstruction method without external reference. RefFree is a learning approach that estimates the 3D coordinates in the atlas space for every pixel in every photograph; simple least-squares fitting can then be used to compute the 3D reconstruction. As a by-product, RefFree also produces an atlas-based segmentation of the reconstructed stack. RefFree is trained on synthetic photographs generated from digitally sliced 3D MRI data, with randomized appearance for enhanced generalization ability. Experiments on simulated and real data show that RefFree achieves performance comparable to the baseline method without an explicit reference while also enabling reconstruction of partial stacks. Our code is available at https://github.com/lintian-a/reffree.
中文:RefFree方法无需外部参考即可实现脑解剖照片的三维重建,它通过基于合成MRI数据训练的学习方法达到相近性能,并能处理不完整的图像序列。
English: The proposed RefFree method enables 3D reconstruction of brain dissection photographs without requiring external references, using a learning approach trained on synthetic MRI data to achieve comparable performance while supporting partial stacks.

Authors:Jiayu Jiang, Changxing Ding, Wentao Tan, Junhong Wang, Jin Tao, Xiangmin Xu
Title: Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Abstract:
Text-to-image person re-identification (ReID) aims to retrieve the images of an interested person based on textual descriptions. One main challenge for this task is the high cost in manually annotating large-scale databases, which affects the generalization ability of ReID models. Recent works handle this problem by leveraging Multi-modal Large Language Models (MLLMs) to describe pedestrian images automatically. However, the captions produced by MLLMs lack diversity in description styles. To address this issue, we propose a Human Annotator Modeling (HAM) approach to enable MLLMs to mimic the description styles of thousands of human annotators. Specifically, we first extract style features from human textual descriptions and perform clustering on them. This allows us to group textual descriptions with similar styles into the same cluster. Then, we employ a prompt to represent each of these clusters and apply prompt learning to mimic the description styles of different human annotators. Furthermore, we define a style feature space and perform uniform sampling in this space to obtain more diverse clustering prototypes, which further enriches the diversity of the MLLM-generated captions. Finally, we adopt HAM to automatically annotate a massive-scale database for text-to-image ReID. Extensive experiments on this database demonstrate that it significantly improves the generalization ability of ReID models.
中文: 本文提出了一种人类标注者建模方法,通过聚类风格特征并应用提示学习,使多模态大语言模型能够模仿多样的人类描述风格,从而通过自动大规模数据库标注显著提升了文本到图像行人重识别模型的泛化能力。
English: This paper introduces a Human Annotator Modeling (HAM) approach that enables Multi-modal Large Language Models to mimic diverse human description styles by clustering style features and using prompt learning, thereby enhancing the generalization ability of text-to-image person re-identification models through automated large-scale database annotation.

Authors:Zhenyu Liu, Dongfang Li, Xinshuo Hu, Xinping Zhao, Yibin Chen, Baotian Hu, Min Zhang
Title: Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment
Abstract:
Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become redundant.Motivated by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further demonstrations.Extensive experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at https://github.com/HITsz-TMG/PICA.
Chinese: 本研究提出渐进式上下文对齐方法(PICA),通过两阶段设计先从示例中提取任务函数再用于零样本生成,在降低时间成本的同时实现了比标准上下文学习更优的对齐性能。
English: This study introduces Progressive In-Context Alignment (PICA), a two-stage method that first extracts task functions from demonstrations and then uses them for zero-shot generation, achieving superior alignment performance while reducing time costs compared to standard in-context learning.

Authors:Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Abstract:
With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at https://github.com/bytedance/UVE.
中文: 本研究提出采用多模态大语言模型作为AI生成视频的统一评估器,通过UVE-Bench基准测试证明其虽不及人工评估,但已显著超越现有专业评估方法。
English: This study proposes using multimodal large language models (MLLMs) as unified evaluators for AI-generated videos, introducing UVE-Bench benchmark to demonstrate their superior performance over specialized methods while still trailing human assessment.

Authors:Allison Andreyev
Title: Quantization for OpenAI's Whisper Models: A Comparative Analysis
Abstract:
Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git
中文: 本研究分析了三种Whisper语音识别模型,发现量化技术在保持精度的同时显著降低了延迟和模型体积,为边缘设备部署提供了可行方案。
English: This study analyzes three Whisper ASR models, revealing that quantization techniques significantly reduce latency and model size while maintaining accuracy, offering practical solutions for edge device deployment.

Authors:Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, Mohammad Aliannejadi
Title: Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Abstract:
The rise of personalized conversational search systems has been driven by advancements in Large Language Models (LLMs), enabling these systems to retrieve and generate answers for complex information needs. However, the automatic evaluation of responses generated by Retrieval Augmented Generation (RAG) systems remains an understudied challenge. In this paper, we introduce a new resource for assessing the retrieval effectiveness and relevance of response generated by RAG systems, using a nugget-based evaluation framework. Built upon the foundation of TREC iKAT 2023, our dataset extends to the TREC iKAT 2024 collection, which includes 17 conversations and 20,575 relevance passage assessments, together with 2,279 extracted gold nuggets, and 62 manually written gold answers from NIST assessors. While maintaining the core structure of its predecessor, this new collection enables a deeper exploration of generation tasks in conversational settings. Key improvements in iKAT 2024 include: (1) ``gold nuggets'' -- concise, essential pieces of information extracted from relevant passages of the collection -- which serve as a foundation for automatic response evaluation; (2) manually written answers to provide a gold standard for response evaluation; (3) unanswerable questions to evaluate model hallucination; (4) expanded user personas, providing richer contextual grounding; and (5) a transition from Personal Text Knowledge Base (PTKB) ranking to PTKB classification and selection. Built on this resource, we provide a framework for long-form answer generation evaluation, involving nuggets extraction and nuggets matching, linked to retrieval. This establishes a solid resource for advancing research in personalized conversational search and long-form answer generation. Our resources are publicly available at https://github.com/irlabamsterdam/CONE-RAG.
中文: 本文基于TREC iKAT 2024构建了新的评估资源,通过金块信息和人工撰写答案来评估检索增强生成系统的响应相关性,并针对对话搜索中的模型幻觉问题进行改进。
English: This paper introduces a new resource built on TREC iKAT 2024 for evaluating Retrieval Augmented Generation systems, featuring gold nuggets and manual answers to assess response relevance and combat model hallucination in conversational search.

Authors:Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee
Title: What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models
Abstract:
The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.
中文摘要:科学文献的爆炸式增长使跨学科知识整合愈发困难,本研究通过将大语言模型与结构化概念框架相结合,构建了一个能够从数万篇论文中提取精确关联并揭示新兴趋势的知识图谱系统。
English Summary: The exponential growth of scientific literature challenges knowledge synthesis, but this work introduces a system combining large language models with structured concept schemas to extract precise relationships and emerging trends from thousands of papers across multiple disciplines.

Authors:Daniel Syomichev, Padmini Gopinath, Guang-Lin Wei, Eric Chang, Ian Gordon, Amanuel Seifu, Rahul Pemmaraju, Neehar Peri, James Purtilo
Title: QuickDraw: Fast Visualization, Analysis and Active Learning for Medical Image Segmentation
Abstract:
Analyzing CT scans, MRIs and X-rays is pivotal in diagnosing and treating diseases. However, detecting and identifying abnormalities from such medical images is a time-intensive process that requires expert analysis and is prone to interobserver variability. To mitigate such issues, machine learning-based models have been introduced to automate and significantly reduce the cost of image segmentation. Despite significant advances in medical image analysis in recent years, many of the latest models are never applied in clinical settings because state-of-the-art models do not easily interface with existing medical image viewers. To address these limitations, we propose QuickDraw, an open-source framework for medical image visualization and analysis that allows users to upload DICOM images and run off-the-shelf models to generate 3D segmentation masks. In addition, our tool allows users to edit, export, and evaluate segmentation masks to iteratively improve state-of-the-art models through active learning. In this paper, we detail the design of our tool and present survey results that highlight the usability of our software. Notably, we find that QuickDraw reduces the time to manually segment a CT scan from four hours to six minutes and reduces machine learning-assisted segmentation time by 10\% compared to prior work. Our code and documentation are available at https://github.com/qd-seg/quickdraw
中文:QuickDraw 是一个开源框架,支持医学影像可视化和分析,用户可上传 DICOM 图像运行模型生成 3D 分割掩码,并通过主动学习迭代优化结果,将 CT 扫描分割时间从四小时缩短至六分钟。
English: QuickDraw is an open-source framework that enables medical image visualization and analysis by allowing users to upload DICOM images, run models for 3D segmentation, and iteratively improve results through active learning, reducing CT scan segmentation time from four hours to six minutes.

Authors:Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, Jianming Liang
Title: Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis
Abstract:
Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at https://github.com/jlianglab/Foundation_X.
中文: nFoundation X框架采用锁定-释放预训练策略和师生学习模式,整合多个医学数据集的专家标注,训练出能同时胜任分类、定位和分割任务的基础模型,有效避免过拟合并提升跨数据集学习能力。
English: The nFoundation X framework leverages diverse expert annotations from multiple medical datasets through a Lock-Release pretraining strategy and student-teacher learning to train a versatile foundation model that excels in classification, localization, and segmentation tasks while preventing overfitting.

Authors:Benjamin Towle, Xin Chen, Ke Zhou
Title: SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAM
Abstract:
Pre-trained segmentation models are a powerful and flexible tool for segmenting images. Recently, this trend has extended to medical imaging. Yet, often these methods only produce a single prediction for a given image, neglecting inherent uncertainty in medical images, due to unclear object boundaries and errors caused by the annotation tool. Multiple Choice Learning is a technique for generating multiple masks, through multiple learned prediction heads. However, this cannot readily be extended to producing more outputs than its initial pre-training hyperparameters, as the sparse, winner-takes-all loss function makes it easy for one prediction head to become overly dominant, thus not guaranteeing the clinical relevancy of each mask produced. We introduce SeqSAM, a sequential, RNN-inspired approach to generating multiple masks, which uses a bipartite matching loss for ensuring the clinical relevancy of each mask, and can produce an arbitrary number of masks. We show notable improvements in quality of each mask produced across two publicly available datasets. Our code is available at https://github.com/BenjaminTowle/SeqSAM.
中文: SeqSAM提出了一种基于循环神经网络启发的顺序方法,通过双向匹配损失生成多个临床相关的医学图像分割掩码,在两个公开数据集上显著提升了掩码质量。
English: SeqSAM introduces a sequential, RNN-inspired method that generates multiple clinically relevant masks for medical image segmentation using a bipartite matching loss, demonstrating improved mask quality across datasets.

Authors:William L. Tong, Cengiz Pehlevan
Title: Learning richness modulates equality reasoning in neural networks
Abstract:
Equality reasoning is ubiquitous and purely abstract: sameness or difference may be evaluated no matter the nature of the underlying objects. As a result, same-different (SD) tasks have been extensively studied as a starting point for understanding abstract reasoning in humans and across animal species. With the rise of neural networks that exhibit striking apparent proficiency for abstractions, equality reasoning in these models has also gained interest. Yet despite extensive study, conclusions about equality reasoning vary widely and with little consensus. To clarify the underlying principles in learning SD tasks, we develop a theory of equality reasoning in multi-layer perceptrons (MLP). Following observations in comparative psychology, we propose a spectrum of behavior that ranges from conceptual to perceptual outcomes. Conceptual behavior is characterized by task-specific representations, efficient learning, and insensitivity to spurious perceptual details. Perceptual behavior is characterized by strong sensitivity to spurious perceptual details, accompanied by the need for exhaustive training to learn the task. We develop a mathematical theory to show that an MLP's behavior is driven by learning richness. Rich-regime MLPs exhibit conceptual behavior, whereas lazy-regime MLPs exhibit perceptual behavior. We validate our theoretical findings in vision SD experiments, showing that rich feature learning promotes success by encouraging hallmarks of conceptual behavior. Overall, our work identifies feature learning richness as a key parameter modulating equality reasoning, and suggests that equality reasoning in humans and animals may similarly depend on learning richness in neural circuits.
中文: 本研究提出一个理论,表明多层感知机在相等性推理中的行为受特征学习丰富度调控,可分为概念性行为和感知性行为,富学习模式实现概念化推理而惰性模式停留于感知层面,并通过视觉实验验证了这一发现。
English: Our study develops a theory showing that multi-layer perceptrons' equality reasoning behavior spans from conceptual to perceptual outcomes, driven by feature learning richness, with rich-regime models achieving conceptual behavior and lazy-regime ones remaining perceptual, validated through vision experiments.

Authors:Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri
Title: AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
Abstract:
Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web-navigation agents follow the privacy principle of ``data minimization''. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is ``necessary'' to complete a particular task. Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting-based defense that reduces information leakage, and demonstrate that our end-to-end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time.
中文: 本文提出了AgentDAM基准,用于评估AI网页导航代理是否遵循数据最小化原则,仅在必要时使用个人信息,发现当前模型即使采用提示防御措施减少泄露,仍常无意中滥用敏感数据。
English: This paper introduces AgentDAM, a benchmark to assess whether AI web-navigation agents adhere to data minimization by using personal information only when necessary, revealing that current models often inadvertently misuse sensitive data despite a proposed prompting defense that reduces leakage.

Authors:Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri
Title: AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
Abstract:
Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web-navigation agents follow the privacy principle of ``data minimization''. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is ``necessary'' to complete a particular task. Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting-based defense that reduces information leakage, and demonstrate that our end-to-end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time.
中文: 本文提出了AgentDAM基准,用于评估AI网页导航代理是否遵循数据最小化原则,仅在必要时使用个人信息,发现当前模型即使采用提示防御措施减少泄露,仍常无意中滥用敏感数据。
English: This paper introduces AgentDAM, a benchmark to assess whether AI web-navigation agents adhere to data minimization by using personal information only when necessary, revealing that current models often inadvertently misuse sensitive data despite a proposed prompting defense that reduces leakage.

Authors:Tairan Xu, Leyang Xue, Zhan Lu, Adrian Jackson, Luo Mai
Title: MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Abstract:
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen
中文: MoE-Gen提出模块化批处理系统,通过动态启动大批次操作优化GPU利用率,在MoE模型上实现比现有推理系统高8-31倍的吞吐量。
English: MoE-Gen introduces a module-based batching system that optimizes GPU utilization by dynamically launching large batches, achieving 8-31x higher throughput than existing inference systems on MoE models.

Authors:Hongyu Lin, Yuchen Li, Haoran Luo, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu
Title: BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent
Abstract:
Operating System (OS) kernel tuning involves systematically adjusting kernel configurations to optimize system performance. Despite recent advancements in large language models (LLMs), kernel tuning remains a critical challenge due to: (1) the semantic gap between abstract tuning objective and concrete config options, (2) insufficient environmental interaction induces LLM hallucinations, and (3) the rapid evolution of kernel versions. To address these challenges, we propose BYOS, a LLM-powered framework that automates kernel tuning through three key innovations: structured knowledge construction and mapping, knowledge-driven configuration generation, and continuous knowledge maintenance. Extensive experiments show that BYOS achieves 7.1%-155.4% performance improvements over default configurations across standard OS benchmarks and real-world applications, demonstrating structured knowledge representation can overcome key limitations of pure LLM solutions for system optimization. Our code is available at https://github.com/LHY-24/BYOS.
中文:BYOS是一个基于大语言模型的框架,通过结构化知识构建与映射、知识驱动的配置生成及持续维护,解决了内核调优中的语义鸿沟、幻觉问题和版本适配难题,在基准测试中实现了较默认配置7.1%-155.4%的性能提升。
English: BYOS is an LLM-powered framework that automates kernel tuning by addressing the semantic gap, reducing hallucinations, and adapting to kernel evolution through structured knowledge, achieving significant performance improvements over default configurations.

Authors:Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie
Title: CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Abstract:
Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.
中文: 本文提出CoRe^2这一即插即用推理方法,通过减少函数评估次数并采用弱到强指导来提升文本到图像模型的输出质量,在多种扩散模型和自回归模型上均实现了高效与高性能的统一。
English: This paper introduces CoRe^2, a plug-and-play inference method that enhances text-to-image models by reducing function evaluations and improving output quality through weak-to-strong guidance, achieving efficiency and effectiveness across various diffusion and autoregressive models.

Authors:David P. Hofmeyr
Title: Bags of Projected Nearest Neighbours: Competitors to Random Forests?
Abstract:
In this paper we introduce a simple and intuitive adaptive k nearest neighbours classifier, and explore its utility within the context of bootstrap aggregating ("bagging"). The approach is based on finding discriminant subspaces which are computationally efficient to compute, and are motivated by enhancing the discrimination of classes through nearest neighbour classifiers. This adaptiveness promotes diversity of the individual classifiers fit across different bootstrap samples, and so further leverages the variance reducing effect of bagging. Extensive experimental results are presented documenting the strong performance of the proposed approach in comparison with Random Forest classifiers, as well as other nearest neighbours based ensembles from the literature, plus other relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from https://github.com/DavidHofmeyr/BOPNN.
Chinese: 本文提出了一种简单直观的自适应k近邻分类器,通过计算高效的判别子空间增强类别区分度,在自助聚合中提升分类器多样性以超越随机森林及其他基准方法,并提供了可用的R软件包实现。
English: This paper presents a simple adaptive k-nearest neighbors classifier that enhances class discrimination through computationally efficient discriminant subspaces, promoting classifier diversity in bootstrap aggregation to outperform Random Forest and other benchmarks, with an available R package for implementation.

Authors:Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yuanheng Zhao, Yuqi Wang, Ziang Wei, Yang You
Title: Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Abstract:
Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-performing video generation model is highly controllable. We detail all techniques that contribute to this efficiency breakthrough, including data curation, model architecture, training strategy, and system optimization. According to human evaluation results and VBench scores, Open-Sora 2.0 is comparable to global leading video generation models including the open-source HunyuanVideo and the closed-source Runway Gen-3 Alpha. By making Open-Sora 2.0 fully open-source, we aim to democratize access to advanced video generation technology, fostering broader innovation and creativity in content creation. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.
中文:Open-Sora 2.0以仅20万美元的成本实现了商业级视频生成,通过全开源技术使先进视频创作技术平民化,其性能媲美全球领先模型。
English: Open-Sora 2.0 demonstrates that high-quality video generation can be achieved with controlled costs of just $200k, matching leading models through optimized techniques while being fully open-source to democratize advanced video creation technology.

Authors:Nannan Wu, Zhuo Kuang, Zengqiang Yan, Ping Wang, Li Yu
Title: Fair Federated Medical Image Classification Against Quality Shift via Inter-Client Progressive State Matching
Abstract:
Despite the potential of federated learning in medical applications, inconsistent imaging quality across institutions-stemming from lower-quality data from a minority of clients-biases federated models toward more common high-quality images. This raises significant fairness concerns. Existing fair federated learning methods have demonstrated some effectiveness in solving this problem by aligning a single 0th- or 1st-order state of convergence (e.g., training loss or sharpness). However, we argue in this work that fairness based on such a single state is still not an adequate surrogate for fairness during testing, as these single metrics fail to fully capture the convergence characteristics, making them suboptimal for guiding fair learning. To address this limitation, we develop a generalized framework. Specifically, we propose assessing convergence using multiple states, defined as sharpness or perturbed loss computed at varying search distances. Building on this comprehensive assessment, we propose promoting fairness for these states across clients to achieve our ultimate fairness objective. This is accomplished through the proposed method, FedISM+. In FedISM+, the search distance evolves over time, progressively focusing on different states. We then incorporate two components in local training and global aggregation to ensure cross-client fairness for each state. This gradually makes convergence equitable for all states, thereby improving fairness during testing. Our empirical evaluations, performed on the well-known RSNA ICH and ISIC 2019 datasets, demonstrate the superiority of FedISM+ over existing state-of-the-art methods for fair federated learning. The code is available at https://github.com/wnn2000/FFL4MIA.
中文: 提出的FedISM+框架通过评估多个收敛状态并动态调整搜索距离来解决联邦学习中的公平性问题,在医学影像数据集上相比现有方法展现出更优性能。
English: The proposed FedISM+ framework addresses fairness limitations in federated learning by evaluating multiple convergence states across clients and dynamically adjusting search distances during training, demonstrating superior performance on medical imaging datasets compared to existing methods.

Authors:Philippe Chlenski, Kaizhu Du, Dylan Satow, Raiyan R. Khan, Itsik Pe'er
Title: Manify: A Python Library for Learning Non-Euclidean Representations
Abstract:
We present Manify, an open-source Python library for non-Euclidean representation learning. Leveraging manifold learning techniques, Manify provides tools for learning embeddings in (products of) non-Euclidean spaces, performing classification and regression with data that lives in such spaces, estimating the curvature of a manifold, and more. Manify aims to advance research and applications in machine learning by offering a comprehensive suite of tools for manifold-based data analysis. Our source code, examples, and documentation are available at https://github.com/pchlenski/manify.
中文: Manify 是一个开源 Python 库,利用流形学习技术在非欧几里得空间中进行嵌入学习、分类、回归和曲率估计等任务,旨在推动机器学习的研究与应用。
English: Manify is an open-source Python library that uses manifold learning to create embeddings in non-Euclidean spaces and perform tasks like classification, regression, and curvature estimation, advancing machine learning research and applications.

Authors:Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov
Title: Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Abstract:
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms
中文: 块扩散语言模型通过支持灵活长度生成和提升推理效率,弥补了扩散与自回归模型间的不足,在语言建模基准测试中达到了最新最优性能。
English: Block diffusion language models bridge the gap between diffusion and autoregressive approaches by enabling flexible-length generation and enhanced inference efficiency, achieving state-of-the-art performance in language modeling.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Title: Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Abstract:
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
中文: 近期大语言模型推理能力的进步依赖于长思维链解决复杂任务,本综述通过区分长短思维链、探讨其关键特性、过度思考等现象及未来方向,填补研究空白,推动人工智能推理发展。
English: Recent advances in reasoning with large language models (RLLMs) leverage long chain-of-thought (Long CoT) to solve complex tasks, and this survey addresses the lack of comprehensive research by distinguishing Long CoT from short CoT, exploring its key traits, phenomena like overthinking, and future directions to advance AI reasoning.

Authors:Zak Buzzard
Title: PairVDN - Pair-wise Decomposed Value Functions
Abstract:
Extending deep Q-learning to cooperative multi-agent settings is challenging due to the exponential growth of the joint action space, the non-stationary environment, and the credit assignment problem. Value decomposition allows deep Q-learning to be applied at the joint agent level, at the cost of reduced expressivity. Building on past work in this direction, our paper proposes PairVDN, a novel method for decomposing the value function into a collection of pair-wise, rather than per-agent, functions, improving expressivity at the cost of requiring a more complex (but still efficient) dynamic programming maximisation algorithm. Our method enables the representation of value functions which cannot be expressed as a monotonic combination of per-agent functions, unlike past approaches such as VDN and QMIX. We implement a novel many-agent cooperative environment, Box Jump, and demonstrate improved performance over these baselines in this setting. We open-source our code and environment at https://github.com/zzbuzzard/PairVDN.
中文摘要:本文提出PairVDN方法,通过成对价值函数分解改进多智能体深度Q学习,突破了传统单调组合方法的表达能力限制,并在新型合作环境中展现出优于基线模型的性能。
English Summary: The paper introduces PairVDN, a novel value decomposition method that enhances expressivity in multi-agent deep Q-learning by using pair-wise functions, overcoming limitations of prior approaches and demonstrating superior performance in a new cooperative environment.

Authors:Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han
Title: Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Abstract:
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
中文: 本文提出Search-R1强化学习框架,使大语言模型能在逐步推理过程中自主生成搜索查询,在多个数据集上相比检索增强生成基线实现了最高达41%的性能提升。
English: This paper presents Search-R1, a reinforcement learning framework that enables large language models to autonomously generate search queries during reasoning, achieving significant performance improvements of up to 41% over retrieval-augmented generation baselines across multiple datasets.

Authors:Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen
Title: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
Abstract:
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public
中文: ReMA框架通过多智能体强化学习将元思考与推理执行分离,有效提升大语言模型在复杂任务中的表现,并通过分层协作机制增强泛化能力。
English: The ReMA framework introduces a multi-agent reinforcement learning approach to enhance large language models' reasoning by decoupling meta-thinking and execution, significantly improving performance on complex tasks.

Authors:Nazanin Moradinasab, Saurav Sengupta, Jiebei Liu, Sana Syed, Donald E. Brown
Title: Towards Robust Multimodal Representation: A Unified Approach with Adaptive Experts and Alignment
Abstract:
Healthcare relies on multiple types of data, such as medical images, genetic information, and clinical records, to improve diagnosis and treatment. However, missing data is a common challenge due to privacy restrictions, cost, and technical issues, making many existing multi-modal models unreliable. To address this, we propose a new multi-model model called Mixture of Experts, Symmetric Aligning, and Reconstruction (MoSARe), a deep learning framework that handles incomplete multimodal data while maintaining high accuracy. MoSARe integrates expert selection, cross-modal attention, and contrastive learning to improve feature representation and decision-making. Our results show that MoSARe outperforms existing models in situations when the data is complete. Furthermore, it provides reliable predictions even when some data are missing. This makes it especially useful in real-world healthcare settings, including resource-limited environments. Our code is publicly available at https://github.com/NazaninMn/MoSARe.
中文: 提出的MoSARe模型通过专家选择和跨模态学习,能有效处理不完整的多模态医疗数据,在数据缺失时仍保持高精度并优于现有方法。
English: The proposed MoSARe model effectively handles incomplete multimodal healthcare data through expert selection and cross-modal learning, maintaining high accuracy and outperforming existing methods even with missing information.

Authors:Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Title: How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
Abstract:
Heterogeneous tabular data poses unique challenges in generative modelling due to its fundamentally different underlying data structure compared to homogeneous modalities, such as images and text. Although previous research has sought to adapt the successes of generative modelling in homogeneous modalities to the tabular domain, defining an effective generator for tabular data remains an open problem. One major reason is that the evaluation criteria inherited from other modalities often fail to adequately assess whether tabular generative models effectively capture or utilise the unique structural information encoded in tabular data. In this paper, we carefully examine the limitations of the prevailing evaluation framework and introduce $\textbf{TabStruct}$, a novel evaluation benchmark that positions structural fidelity as a core evaluation dimension. Specifically, TabStruct evaluates the alignment of causal structures in real and synthetic data, providing a direct measure of how effectively tabular generative models learn the structure of tabular data. Through extensive experiments using generators from eight categories on seven datasets with expert-validated causal graphical structures, we show that structural fidelity offers a task-independent, domain-agnostic evaluation dimension. Our findings highlight the importance of tabular data structure and offer practical guidance for developing more effective and robust tabular generative models. Code is available at https://github.com/SilenceX12138/TabStruct.
中文: 本文提出TabStruct这一新型评估基准,通过衡量真实数据与合成数据间因果结构的对齐程度来评估表格生成模型的结构保真度,解决了现有评估框架的局限性。
English: This paper introduces TabStruct, a novel evaluation benchmark that assesses structural fidelity in tabular generative models by measuring the alignment of causal structures between real and synthetic data, addressing the limitations of existing evaluation frameworks.

Authors:Zhihua Tian, Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Ruoxi Jia, Jiaheng Zhang
Title: Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Abstract:
Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: https://github.com/NANSirun/Interpret-then-deactivate.
中文: 本文提出"解释后停用"(ItD)新框架,通过将概念解析为特征组合并停用目标特征,能在文本到图像生成模型中精确移除特定概念,同时保持整体生成质量并有效抵御对抗性提示攻击。
English: This paper introduces Interpret then Deactivate (ItD), a novel framework that precisely removes unwanted concepts from text-to-image diffusion models by interpreting them as feature combinations and deactivating specific features, effectively preserving overall generation quality and resisting adversarial prompts.

Authors:Krzysztof Adamkiewicz, Paweł W. Woźniak, Julia Dominiak, Andrzej Romanowski, Jakob Karolus, Stanislav Frolov
Title: PromptMap: An Alternative Interaction Style for AI-Based Image Generation
Abstract:
Recent technological advances popularized the use of image generation among the general public. Crafting effective prompts can, however, be difficult for novice users. To tackle this challenge, we developed PromptMap, a new interaction style for text-to-image AI that allows users to freely explore a vast collection of synthetic prompts through a map-like view with semantic zoom. PromptMap groups images visually by their semantic similarity, allowing users to discover relevant examples. We evaluated PromptMap in a between-subject online study ($n=60$) and a qualitative within-subject study ($n=12$). We found that PromptMap supported users in crafting prompts by providing them with examples. We also demonstrated the feasibility of using LLMs to create vast example collections. Our work contributes a new interaction style that supports users unfamiliar with prompting in achieving a satisfactory image output.
Chinese: 近期技术进步使图像生成技术普及化,但新手用户常难以编写有效提示,为此我们开发了PromptMap,这是一种新型交互方式,通过语义地图视图帮助用户自由探索大量合成提示,从而提升生成满意图像的能力。
English: Recent technological advances have made image generation accessible to the public, but novice users often struggle with crafting effective prompts, leading to the development of PromptMap, a new interaction style that helps users explore and create prompts through a semantic map view, improving their ability to generate satisfactory images.

Authors:Richard A. Dubniczky, Krisztofer Zoltán Horvát, Tamás Bisztray, Mohamed Amine Ferrag, Lucas C. Cordeiro, Norbert Tihanyi
Title: CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection
Abstract:
Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect security flaws. This paper introduces CASTLE (CWE Automated Security Testing and Low-Level Evaluation), a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs. We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison. Our results reveal key differences: ESBMC (a formal verification tool) minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection. Static analyzers suffer from high false positives, increasing manual validation efforts for developers. LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets. However, their accuracy declines, and hallucinations increase as the code size grows. These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities. The dataset is accessible at https://github.com/CASTLE-Benchmark.
中文: 本文介绍了CASTLE基准测试框架,通过定制数据集评估多种漏洞检测方法,发现形式验证工具误报率最低但检测范围有限,静态分析工具误报率高,而大语言模型在小代码片段中表现优异但随代码规模增大效果下降,表明其在代码补全系统中实时安全防护方面具有潜力。
English: This paper presents CASTLE, a benchmarking framework that evaluates various vulnerability detection methods using a custom dataset and reveals that while formal verification tools minimize false positives, static analyzers produce many, and LLMs excel with small code snippets but struggle as code size increases, suggesting their potential in real-time security guidance within code completion systems.

Authors:Yifan Zhou, Zeqi Xiao, Shuai Yang, Xingang Pan
Title: Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
Abstract:
Latent Diffusion Models (LDMs) are known to have an unstable generation process, where even small perturbations or shifts in the input noise can lead to significantly different outputs. This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. While introducing anti-aliasing operations can partially improve shift-equivariance, significant aliasing and inconsistency persist due to the unique challenges in LDMs, including 1) aliasing amplification during VAE training and multiple U-Net inferences, and 2) self-attention modules that inherently lack shift-equivariance. To address these issues, we redesign the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain. The resulting alias-free LDM (AF-LDM) achieves strong shift-equivariance and is also robust to irregular warping. Extensive experiments demonstrate that AF-LDM produces significantly more consistent results than vanilla LDM across various applications, including video editing and image-to-image translation.
中文摘要:本研究重新设计隐扩散模型以实现平移等变性,通过改进注意力模块和引入等变性损失解决不稳定性与混叠问题,提出的无混叠隐扩散模型在视频编辑等应用中能稳定生成一致结果。
English Summary: The study redesigns Latent Diffusion Models to achieve shift-equivariance, addressing instability and aliasing issues through modified attention modules and an equivariance loss, resulting in the alias-free LDM that ensures consistent outputs for applications like video editing.

Authors:Kevin Qinghong Lin, Mike Zheng Shou
Title: VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Abstract:
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's flexible upgrading over narration vocabulary. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at https://github.com/showlab/VLog.
中文摘要:VLog是一种创新的视频理解框架,将视频叙述定义为词汇表,结合生成推理与对比检索,通过分层事件索引生成简洁准确的视频描述。
English Summary: VLog is a novel video understanding framework that treats video narrations as a vocabulary, integrating generative reasoning with contrastive retrieval and hierarchical event indexing to produce concise and accurate narrations.

Authors:Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel
Title: ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Abstract:
Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.
Chinese: ForAug是一种新颖的数据增强方法,通过分离和重组前景与背景来提升图像多样性并减少Vision Transformer的偏差,显著提高了其在ImageNet及下游任务中的准确性和鲁棒性。
English: ForAug is a novel data augmentation method that enhances image diversity and mitigates biases in Vision Transformers, significantly boosting their accuracy and robustness on tasks like ImageNet and downstream applications.

Authors:Masoud Jamshidiyan Tehrani, Jinhan Kim, Paolo Tonella
Title: PCLA: A Framework for Testing Autonomous Agents in the CARLA Simulator
Abstract:
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often face challenges in utilizing them within customized test environments and scenarios. To address these challenges, we introduce PCLA (Pretrained CARLA Leaderboard Agents), an open-source Python testing framework that includes nine high-performing pre-trained autonomous agents from the Leaderboard challenges. PCLA is the first infrastructure specifically designed for testing various autonomous agents in arbitrary CARLA environments/scenarios. PCLA provides a simple way to deploy Leaderboard agents onto a vehicle without relying on the Leaderboard codebase, it allows researchers to easily switch between agents without requiring modifications to CARLA versions or programming environments, and it is fully compatible with the latest version of CARLA while remaining independent of the Leaderboard's specific CARLA version. PCLA is publicly accessible at https://github.com/MasoudJTehrani/PCLA.
中文: PCLA是一个开源Python测试框架,集成了九款CARLA排行榜预训练自动驾驶智能体,可在任意CARLA环境中便捷部署测试,且完全兼容最新版本模拟器。
English: PCLA is an open-source Python framework that simplifies testing of nine high-performing pre-trained autonomous driving agents from CARLA Leaderboard in custom scenarios, offering easy deployment and full compatibility with the latest CARLA simulator.

Authors:Jiani Huang, Shijie Wang, Liang-bo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, Qing Li
Title: Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs
Abstract:
Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.
中文摘要:大型语言模型正将推荐系统转变为智能助手,但现有数据集缺乏真实场景的高质量文本查询,因此我们推出RecBench+新基准来全面评估模型处理复杂用户需求的能力。
English Summary: Large language models are transforming recommender systems into intelligent assistants, but their evaluation is limited by the lack of realistic datasets, prompting the creation of RecBench+ to better assess their capabilities across diverse and complex user queries.

Authors:Rui Huang, Siyu Tang, Zhiqian Cai, Lin Zhao
Title: Robust Self-Reconfiguration for Fault-Tolerant Control of Modular Aerial Robot Systems
Abstract:
Modular Aerial Robotic Systems (MARS) consist of multiple drone units assembled into a single, integrated rigid flying platform. With inherent redundancy, MARS can self-reconfigure into different configurations to mitigate rotor or unit failures and maintain stable flight. However, existing works on MARS self-reconfiguration often overlook the practical controllability of intermediate structures formed during the reassembly process, which limits their applicability. In this paper, we address this gap by considering the control-constrained dynamic model of MARS and proposing a robust and efficient self-reconstruction algorithm that maximizes the controllability margin at each intermediate stage. Specifically, we develop algorithms to compute optimal, controllable disassembly and assembly sequences, enabling robust self-reconfiguration. Finally, we validate our method in several challenging fault-tolerant self-reconfiguration scenarios, demonstrating significant improvements in both controllability and trajectory tracking while reducing the number of assembly steps. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-Reconfig/
中文摘要:本文提出一种模块化空中机器人系统的鲁棒自重构算法,确保在中间组装阶段的可控性,并通过容错场景验证了其性能提升。
English Summary: This paper introduces a robust self-reconfiguration algorithm for Modular Aerial Robotic Systems that ensures controllability during intermediate assembly stages, validated through fault-tolerant scenarios with improved performance.

Authors:Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller
Title: PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling
Abstract:
We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we conduct a comprehensive comparison of recent autoregressive methods (VAR and MaskGIT) for entropy modeling and evaluate our approach on the large-scale MSCOCO-30k benchmark. Compared to previous work, PerCoV2 (i) achieves higher image fidelity at even lower bit-rates while maintaining competitive perceptual quality, (ii) features a hybrid generation mode for further bit-rate savings, and (iii) is built solely on public components. Code and trained models will be released at https://github.com/Nikolai10/PerCoV2.
Chinese: PerCoV2 是一种先进的超低码率图像压缩系统,它基于 Stable Diffusion 3 生态系统改进了熵编码,在保持感知质量的同时实现了更高的图像保真度和更低的码率,且完全采用公开组件构建。
English: PerCoV2 is an advanced open-source ultra-low bit-rate image compression system that improves upon previous models by integrating with Stable Diffusion 3 and enhancing entropy coding, achieving higher fidelity and lower bit-rates while using only public components.

Authors:Jiushen Cai, Weihang Zhang, Hanruo Liu, Ningli Wang, Huiqi Li
Title: RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports
Abstract:
Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.
Chinese: 本研究针对临床眼底报告缺乏标准化的问题,开发了双语模型RetSTA-7B,该模型通过整合标准化数据,在报告级标准化任务中表现出优于其他大语言模型的性能。
English: This study addresses the lack of standardization in clinical fundus reports by developing RetSTA-7B, a bilingual model that integrates standardized data to achieve superior performance in report-level standardization compared to other large language models.

Authors:Rui Huang, Zhenyu Zhang, Siyu Tang, Zhiqian Cai, Lin Zhao
Title: MARS-FTCP: Robust Fault-Tolerant Control and Agile Trajectory Planning for Modular Aerial Robot Systems
Abstract:
Modular Aerial Robot Systems (MARS) consist of multiple drone units that can self-reconfigure to adapt to various mission requirements and fault conditions. However, existing fault-tolerant control methods exhibit significant oscillations during docking and separation, impacting system stability. To address this issue, we propose a novel fault-tolerant control reallocation method that adapts to an arbitrary number of modular robots and their assembly formations. The algorithm redistributes the expected collective force and torque required for MARS to individual units according to their moment arm relative to the center of MARS mass. Furthermore, we propose an agile trajectory planning method for MARS of arbitrary configurations, which is collision-avoiding and dynamically feasible. Our work represents the first comprehensive approach to enable fault-tolerant and collision avoidance flight for MARS. We validate our method through extensive simulations, demonstrating improved fault tolerance, enhanced trajectory tracking accuracy, and greater robustness in cluttered environments. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-FTCP/
中文摘要:本研究针对模块化空中机器人系统提出了一种新型容错控制重分配方法和灵活轨迹规划,有效提升了系统重构稳定性,并实现了复杂环境中的无碰撞飞行能力。
English Summary: This study introduces a novel fault-tolerant control reallocation method and agile trajectory planning for Modular Aerial Robot Systems (MARS), enhancing stability during reconfiguration and enabling collision-free flight in complex environments.

Authors:Yuzhi Lai, Shenghai Yuan, Youssef Nassar, Mingyu Fan, Thomas Weber, Matthias Rätsch
Title: NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model
Abstract:
Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2\% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.
中文: 本文提出NVP-HRI多模态人机交互系统,通过结合语音指令、指示姿态与Segment Anything模型及大语言模型,实现了对新物体的零样本预测交互,相比传统方法显著提升了交互效率。
English: This paper introduces NVP-HRI, a multimodal human-robot interaction system that combines voice commands and pointing gestures with the Segment Anything Model and large language models to enable intuitive interaction with new objects through zero-shot prediction, significantly improving efficiency over traditional methods.

Authors:Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Title: Group-robust Machine Unlearning
Abstract:
Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group, we empirically show that performance for this group degrades, leading to fairness issues. This work tackles the overlooked problem of non-uniformly distributed forget sets, which we call group-robust machine unlearning, by presenting a simple, effective strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at https://github.com/tdemin16/group-robust_machine_unlearning.
中文摘要:机器遗忘旨在消除特定训练数据对模型的影响同时保留整体知识,本研究提出MIU方法,通过互信息最小化和样本重加权来解决非均匀遗忘集导致的公平性问题。
English Summary: Machine unlearning addresses removing specific training data's influence while preserving overall model knowledge, with this work introducing MIU to mitigate fairness issues from non-uniform forget sets by using mutual information minimization and sample reweighting.

Authors:Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars
Title: DAVE: Diagnostic benchmark for Audio Visual Evaluation
Abstract:
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
中文:DAVE基准数据集通过要求两种模态共同参与正确回答,并分解评估至原子子类别,解决了视听理解中的视觉偏见和综合评分问题,从而精准识别模型缺陷。
English: The DAVE benchmark dataset addresses visual bias and aggregate scoring issues in audio-visual understanding by requiring both modalities for correct answers and providing decoupled evaluation across subcategories to identify specific model weaknesses.

Authors:Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Zhao Lv
Title: COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation
Abstract:
With the rapid advancements in Large Language Models (LLMs), an increasing number of studies have leveraged LLMs as the cognitive core of agents to address complex task decision-making challenges. Specially, recent research has demonstrated the potential of LLM-based agents on automating Windows GUI operations. However, existing methodologies exhibit two critical challenges: (1) static agent architectures fail to dynamically adapt to the heterogeneous requirements of OS-level tasks, leading to inadequate scenario generalization;(2) the agent workflows lack fault tolerance mechanism, necessitating complete process re-execution for UI agent decision error. To address these limitations, we introduce \textit{COLA}, a collaborative multi-agent framework for automating Windows UI operations. In this framework, a scenario-aware agent Task Scheduler decomposes task requirements into atomic capability units, dynamically selects the optimal agent from a decision agent pool, effectively responds to the capability requirements of diverse scenarios. The decision agent pool supports plug-and-play expansion for enhanced flexibility. In addition, we design a memory unit equipped to all agents for their self-evolution. Furthermore, we develop an interactive backtracking mechanism that enables human to intervene to trigger state rollbacks for non-destructive process repair. Our experimental results on the GAIA benchmark demonstrates that the \textit{COLA} framework achieves state-of-the-art performance with an average score of 31.89\%, significantly outperforming baseline approaches without web API integration. Ablation studies further validate the individual contributions of our dynamic scheduling. The code is available at https://github.com/Alokia/COLA-demo.
中文: COLA框架通过情景感知的任务调度和交互式回溯机制,动态适应多样化的Windows图形界面操作任务,在GAIA基准测试中实现了最优性能。
English: The COLA framework introduces a collaborative multi-agent system that dynamically adapts to diverse Windows GUI tasks through scenario-aware scheduling and interactive backtracking, achieving state-of-the-art performance on the GAIA benchmark.

Authors:Wei He, Shangzhi Zhang, Chun-Guang Li, Xianbiao Qi, Rong Xiao, Jun Guo
Title: Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering
Abstract:
Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach. Our code is available at: https://github.com/hewei98/NeuNcut.
Chinese: 本文提出神经归一化割(NeuNcut),一种通过神经网络直接学习聚类成员的可扩展和泛化方法,无需特征分解和k均值步骤,能高效处理大规模数据。
English: The paper introduces Neural Normalized Cut (NeuNcut), a scalable and generalizable method that uses a neural network to directly learn clustering membership, eliminating the need for eigen-decomposition and k-means steps while efficiently handling large-scale data.

Authors:Xiuwen Fang, Mang Ye, Bo Du
Title: Robust Asymmetric Heterogeneous Federated Learning with Corrupted Clients
Abstract:
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments. Our code and models are public available at https://github.com/FangXiuwen/RAHFL.
中文摘要:本文提出了一种新颖的鲁棒非对称异构联邦学习框架,通过多样性增强的对比学习和非对称协作策略,有效解决了联邦系统中模型异构和数据损坏的挑战。
English Summary: This paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework that employs diversity-enhanced contrastive learning and asymmetric collaboration strategies to address model heterogeneity and data corruption in federated systems.

Authors:Falko Helm, Nico Daheim, Iryna Gurevych
Title: Token Weighting for Long-Range Language Modeling
Abstract:
Many applications of large language models (LLMs) require long-context understanding, but models continue to struggle with such tasks. We hypothesize that conventional next-token prediction training could contribute to this, because each token is assigned equal weight. Yet, intuitively, the amount of context needed to predict the next token accurately varies greatly across different data. To reflect this, we propose various novel token-weighting schemes that assign different weights to each training token in the loss, thereby generalizing existing works. For this, we categorize token-weighting methods using a two-step framework which compares the confidences of a long-context and short-context model to score tokens. We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful to improve the long-context abilities of LLMs. Different short-context models can be used effectively for token scoring, including models that are much smaller than the long-context model that is trained. All in all, this work contributes to a better understanding of the trade-offs long-context language modeling faces and provides guidelines for model steering via loss-weighting based on empirical evidence. The code can be found on Github.
中文: 本研究提出新颖的令牌加权方案,通过在训练中分配不同的损失权重来增强大语言模型的长上下文理解能力,并通过实证验证在多任务中实现了性能提升。
English: This study proposes novel token-weighting schemes that assign varying loss weights during training to enhance large language models' long-context understanding, demonstrating improved performance across multiple tasks through empirical validation.

Authors:Zicheng Zhang, Haoning Wu, Ziheng Jia, Weisi Lin, Guangtao Zhai
Title: Teaching LMMs for Image Quality Scoring and Interpreting
Abstract:
Image quality scoring and interpreting are two fundamental components of Image Quality Assessment (IQA). The former quantifies image quality, while the latter enables descriptive question answering about image quality. Traditionally, these two tasks have been addressed independently. However, from the perspective of the Human Visual System (HVS) and the Perception-Decision Integration Model, they are inherently interconnected: interpreting serves as the foundation for scoring, while scoring provides an abstract summary of interpreting. Thus, unifying these capabilities within a single model is both intuitive and logically coherent. In this paper, we propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables large multimodal models (LMMs) to learn both image quality scoring and interpreting simultaneously. We achieve this by transforming conventional IQA datasets into learnable question-answering datasets and incorporating human-annotated quality interpreting data for training. Furthermore, we introduce an efficient scoring & interpreting balance strategy, which first determines the optimal data mix ratio on lightweight LMMs and then maps this ratio to primary LMMs for fine-tuning adjustment. This strategy not only mitigates task interference and enhances cross-task knowledge transfer but also significantly reduces computational costs compared to direct optimization on full-scale LMMs. With this joint learning framework and corresponding training strategy, we develop Q-SiT, the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.Project page at https://github.com/Q-Future/Q-SiT.
中文摘要:本文提出Q-SiT统一框架,通过将图像质量评估数据集转化为问答格式并采用高效平衡策略,使大型多模态模型能够同时学习图像质量评分与解释,在保证双任务性能的同时显著降低计算成本。
English Summary: The paper introduces Q-SiT, a unified framework that enables large multimodal models to jointly learn image quality scoring and interpreting by transforming IQA datasets into question-answering formats and implementing an efficient balance strategy to optimize performance while reducing computational costs.

Authors:Yuanyang Zhang, Yijie Lin, Weiqing Yan, Li Yao, Xinhang Wan, Guangyuan Li, Chao Zhang, Guanzhou Ke, Jie Xu
Title: Incomplete Multi-view Clustering via Diffusion Contrastive Generation
Abstract:
Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches. The code is available at https://github.com/zhangyuanyang21/2025-AAAI-DCG.
中文: 提出的扩散对比生成方法通过结合扩散过程和对比学习,有效恢复多视图数据中的缺失视图,并利用实例级和类别级交互学习实现稳健的端到端聚类,在任意缺失场景下均优于现有方法。
English: The proposed Diffusion Contrastive Generation (DCG) method addresses incomplete multi-view clustering by leveraging diffusion processes and contrastive learning to accurately recover missing views and integrate multi-level interactive learning, achieving superior clustering performance without heavy reliance on paired data.

Authors:Chengshu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan
Title: SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video
Abstract:
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.
中文: 本文提出SwapAnyone模型,将视频换体重新定义为具有身份、动作和环境三个关键一致性的独立视频修复任务,通过创新的EnvHarmony策略和HumanAction-32K数据集实现了最先进的性能。
English: This paper introduces SwapAnyone, an end-to-end model that redefines video body-swapping as an independent video inpainting task with three key consistencies, achieving state-of-the-art performance through a novel EnvHarmony strategy and a comprehensive HumanAction-32K dataset.

Authors:Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li
Title: Generative Frame Sampler for Long Video Understanding
Abstract:
Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.
中文: 本文提出生成式帧采样器(GenS),作为一种即插即用模块,通过识别问题相关帧来提升视频大语言模型处理长视频的效率,并借助新构建的大规模数据集在多项基准测试中取得了领先性能。
English: This paper introduces Generative Frame Sampler (GenS), a plug-and-play module that enhances VideoLLMs' efficiency in processing long-form videos by identifying question-relevant frames, achieving state-of-the-art results on benchmarks through a newly constructed large-scale dataset.

Authors:David P. Hofmeyr
Title: Clustering by Nonparametric Smoothing
Abstract:
A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/ CNS
中文: 本文提出了一种新的聚类方法,通过非参数平滑估计聚类隶属度分布,能够自动确定模型灵活性和聚类数量,并在实验中展现出优于现有方法的性能。
English: This paper introduces a novel clustering formulation that estimates cluster membership distributions through nonparametric smoothing, automatically determining both model flexibility and cluster count while demonstrating superior performance in experiments.

Authors:Zhehui Wu, Yong Chen, Naoto Yokoya, Wei He
Title: MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
Abstract:
Hyperspectral images (HSIs) often suffer from diverse and unknown degradations during imaging, leading to severe spectral and spatial distortions. Existing HSI restoration methods typically rely on specific degradation assumptions, limiting their effectiveness in complex scenarios. In this paper, we propose \textbf{MP-HSIR}, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. Specifically, we develop a prompt-guided spatial-spectral transformer, which incorporates spatial self-attention and a prompt-guided dual-branch spectral self-attention. Since degradations affect spectral features differently, we introduce spectral prompts in the local spectral branch to provide universal low-rank spectral patterns as prior knowledge for enhancing spectral reconstruction. Furthermore, the text-visual synergistic prompt fuses high-level semantic representations with fine-grained visual features to encode degradation information, thereby guiding the restoration process. Extensive experiments on 9 HSI restoration tasks, including all-in-one scenarios, generalization tests, and real-world cases, demonstrate that MP-HSIR not only consistently outperforms existing all-in-one methods but also surpasses state-of-the-art task-specific approaches across multiple tasks. The code and models are available at https://github.com/ZhehuiWu/MP-HSIR.
中文: 本文提出MP-HSIR多提示框架,通过融合光谱、文本和视觉提示,实现对多种退化类型高光谱图像的通用恢复,在广泛实验中性能超越现有方法。
English: This paper introduces MP-HSIR, a multi-prompt framework that integrates spectral, textual, and visual prompts to universally restore hyperspectral images across various degradation types, outperforming existing methods in extensive experiments.

Authors:Jin Li, Ziqiang He, Anwei Luo, Jian-Fang Hu, Z. Jane Wang, Xiangui Kang
Title: AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks
Abstract:
Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible perturbation to the input data. Previous methods typically improve the imperceptibility of attacks by integrating common attack paradigms with specifically designed perception-based losses or the capabilities of generative models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a novel modeling framework distinct from existing attack paradigms. AdvAD innovatively conceptualizes attacking as a non-parametric diffusion process by theoretically exploring basic modeling approach rather than using the denoising or generation abilities of regular diffusion models requiring neural networks. At each step, much subtler yet effective adversarial guidance is crafted using only the attacked model without any additional network, which gradually leads the end of diffusion process from the original image to a desired imperceptible adversarial example. Grounded in a solid theoretical foundation of the proposed non-parametric diffusion process, AdvAD achieves high attack efficacy and imperceptibility with intrinsically lower overall perturbation strength. Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme of our novel framework under an ideal scenario. Extensive experiments demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9$\%$ (+17.3$\%$) ASR with 1.34 (-0.97) $l_2$ distance, 49.74 (+4.76) PSNR and 0.9971 (+0.0043) SSIM against four prevalent DNNs with three different architectures on the ImageNet-compatible dataset. Code is available at https://github.com/XianguiKang/AdvAD.
Chinese: 本文提出AdvAD,一种新颖的非参数扩散框架,通过理论建模而非神经网络实现难以察觉的对抗攻击,在更低扰动强度下获得更高攻击成功率和视觉隐蔽性。
English: The paper introduces AdvAD, a novel non-parametric diffusion framework for imperceptible adversarial attacks that crafts subtle perturbations without neural networks, achieving superior attack success and imperceptibility with lower perturbation strength.

Authors:Yuechen Xie, Jie Song, Huiqiong Wang, Mingli Song
Title: Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
Abstract:
High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-$α$, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99\% in determining the provenance of suspicious model training data, surpassing all previous methods. Code is available at https://github.com/xieyc99/TrainProVe.
中文摘要:本文提出TrainProVe方法,首次通过验证训练数据来源防止文本到图像模型的未经授权使用,在多个模型中实现了超过99%的验证准确率。
English Summary: This paper introduces TrainProVe, a novel method for verifying the provenance of training data in text-to-image models to prevent unauthorized use, achieving over 99% accuracy across multiple models.

Authors:Zihao Chen, Hisashi Handa, Miho Ohsaki, Kimiaki Shirahama
Title: Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation
Abstract:
Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.
中文摘要:本文提出SDJC方法,通过对比学习和合成数据生成实现日语领域句子嵌入的自监督适应,并构建了基准数据集来评估其有效性。
English Summary: This paper presents SDJC, a self-supervised method using contrastive learning and synthetic data generation to adapt sentence embedding models for Japanese domains, while also creating a benchmark dataset to evaluate these adaptations.

Authors:Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, Xingyao Wang
Title: LocAgent: Graph-Guided LLM Agents for Code Localization
Abstract:
Code localization--identifying precisely where in a codebase changes need to be made--is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses code localization through graph-based representation. By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures (files, classes, functions) and their dependencies (imports, invocations, inheritance), enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning. Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization. Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at https://github.com/gersteinlab/LocAgent.
中文: LocAgent通过基于图的框架将自然语言问题描述与代码元素精准关联,利用多跳推理实现高效的代码定位,在显著降低成本的同时大幅提升了定位准确率。
English: LocAgent introduces a graph-based framework that efficiently maps natural language problem descriptions to relevant code sections through multi-hop reasoning, achieving high accuracy in code localization with significantly reduced costs.

Authors:Byeongchan Lee, Sehyun Lee
Title: Implicit Contrastive Representation Learning with Guided Stop-gradient
Abstract:
In self-supervised representation learning, Siamese networks are a natural architecture for learning transformation-invariance by bringing representations of positive pairs closer together. But it is prone to collapse into a degenerate solution. To address the issue, in contrastive learning, a contrastive loss is used to prevent collapse by moving representations of negative pairs away from each other. But it is known that algorithms with negative sampling are not robust to a reduction in the number of negative samples. So, on the other hand, there are algorithms that do not use negative pairs. Many positive-only algorithms adopt asymmetric network architecture consisting of source and target encoders as a key factor in coping with collapse. By exploiting the asymmetric architecture, we introduce a methodology to implicitly incorporate the idea of contrastive learning. As its implementation, we present a novel method guided stop-gradient. We apply our method to benchmark algorithms SimSiam and BYOL and show that our method stabilizes training and boosts performance. We also show that the algorithms with our method work well with small batch sizes and do not collapse even when there is no predictor. The code is available at https://github.com/bych-lee/gsg.
Chinese: 本文提出了一种引导式停止梯度方法,通过非对称网络架构隐式融入对比学习思想,无需负样本或大批量即可提升SimSiam和BYOL等自监督学习算法的训练稳定性与性能表现。
English: This paper introduces a guided stop-gradient method that enhances self-supervised learning by implicitly incorporating contrastive principles through asymmetric network architectures, improving training stability and performance in algorithms like SimSiam and BYOL without requiring negative samples or large batch sizes.

Authors:Rui Shi, Xiaodong Yu, Shengming Wang, Yijia Zhang, Lu Xu, Peng Pan, Chunlai Ma
Title: RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification
Abstract:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.
中文: 本文提出了RFUAV这一新基准数据集,通过提供来自37种无人机的海量原始数据、用于信号区分的射频无人机指纹以及开源评估工具,解决了现有无人机识别数据集的不足,并实现了最先进的性能。
English: This paper introduces RFUAV, a new benchmark dataset addressing limitations in existing UAV identification datasets by offering extensive raw data from 37 drones, RF drone fingerprints for signal distinction, and open-access evaluation tools, achieving state-of-the-art performance.

Authors:Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
Title: Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Abstract:
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
中文: 本文提出CyclicPrompt,一种创新的循环提示方法,通过组合上下文提示和擦除-粘贴机制,有效利用天气特定知识与可靠纹理,显著提升了通用恶劣天气去除的适应性和恢复质量。
English: This paper introduces CyclicPrompt, a novel cyclic prompt approach for universal adverse weather removal that integrates composite context prompts and an erase-and-paste mechanism to enhance restoration effectiveness and adaptability across various weather degradations.

Authors:Joao D. S. Marques, Arlindo L. Oliveira
Title: Are ECGs enough? Deep learning classification of pulmonary embolism using electrocardiograms
Abstract:
Pulmonary embolism is a leading cause of out of hospital cardiac arrest that requires fast diagnosis. While computed tomography pulmonary angiography is the standard diagnostic tool, it is not always accessible. Electrocardiography is an essential tool for diagnosing multiple cardiac anomalies, as it is affordable, fast and available in many settings. However, the availability of public ECG datasets, specially for PE, is limited and, in practice, these datasets tend to be small, making it essential to optimize learning strategies. In this study, we investigate the performance of multiple neural networks in order to assess the impact of various approaches. Moreover, we check whether these practices enhance model generalization when transfer learning is used to translate information learned in larger ECG datasets, such as PTB-XL, CPSC18 and MedalCare-XL, to a smaller, more challenging dataset for PE. By leveraging transfer learning, we analyze the extent to which we can improve learning efficiency and predictive performance on limited data. Code available at https://github.com/joaodsmarques/Are-ECGs-enough-Deep-Learning-Classifiers .
中文: 本研究通过从大型心电图数据集进行迁移学习,评估神经网络在有限数据条件下对肺栓塞的诊断性能,旨在提升学习效率与预测表现。
English: This study evaluates neural networks using transfer learning from large ECG datasets to improve pulmonary embolism diagnosis on limited data, aiming to enhance learning efficiency and predictive performance.

Authors:Hrishikesh Viswanath, Md Ashiqur Rahman, Chi Lin, Damon Conover, Aniket Bera
Title: HessianForge: Scalable LiDAR reconstruction with Physics-Informed Neural Representation and Smoothness Energy Constraints
Abstract:
Accurate and efficient 3D mapping of large-scale outdoor environments from LiDAR measurements is a fundamental challenge in robotics, particularly towards ensuring smooth and artifact-free surface reconstructions. Although the state-of-the-art methods focus on memory-efficient neural representations for high-fidelity surface generation, they often fail to produce artifact-free manifolds, with artifacts arising due to noisy and sparse inputs. To address this issue, we frame surface mapping as a physics-informed energy optimization problem, enforcing surface smoothness by optimizing an energy functional that penalizes sharp surface ridges. Specifically, we propose a deep learning based approach that learns the signed distance field (SDF) of the surface manifold from raw LiDAR point clouds using a physics-informed loss function that optimizes the $L_2$-Hessian energy of the surface. Our learning framework includes a hierarchical octree based input feature encoding and a multi-scale neural network to iteratively refine the signed distance field at different scales of resolution. Lastly, we introduce a test-time refinement strategy to correct topological inconsistencies and edge distortions that can arise in the generated mesh. We propose a \texttt{CUDA}-accelerated least-squares optimization that locally adjusts vertex positions to enforce feature-preserving smoothing. We evaluate our approach on large-scale outdoor datasets and demonstrate that our approach outperforms current state-of-the-art methods in terms of improved accuracy and smoothness. Our code is available at \href{https://github.com/HrishikeshVish/HessianForge/}{https://github.com/HrishikeshVish/HessianForge/}
Chinese: 本研究提出了一种基于物理信息的深度学习方法,通过优化L2-Hessian能量从LiDAR点云重建平滑无伪影的3D表面,在精度和网格质量上超越了现有技术。
English: This study introduces a physics-informed deep learning method that optimizes the L2-Hessian energy to reconstruct smooth, artifact-free 3D surfaces from LiDAR point clouds, outperforming existing techniques in accuracy and mesh quality.

Authors:Anand Menon, Samit S Miftah, Shamik Kundu, Souvik Kundu, Amisha Srivastava, Arnab Raha, Gabriel Theodor Sonnenschein, Suvadeep Banerjee, Deepak Mathaikutty, Kanad Basu
Title: Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset
Abstract:
Hardware verification is crucial in modern SoC design, consuming around 70% of development time. SystemVerilog assertions ensure correct functionality. However, existing industrial practices rely on manual efforts for assertion generation, which becomes increasingly untenable as hardware systems become complex. Recent research shows that Large Language Models (LLMs) can automate this process. However, proprietary SOTA models like GPT-4o often generate inaccurate assertions and require expensive licenses, while smaller open-source LLMs need fine-tuning to manage HDL code complexities. To address these issues, we introduce **VERT**, an open-source dataset designed to enhance SystemVerilog assertion generation using LLMs. VERT enables researchers in academia and industry to fine-tune open-source models, outperforming larger proprietary ones in both accuracy and efficiency while ensuring data privacy through local fine-tuning and eliminating costly licenses. The dataset is curated by systematically augmenting variables from open-source HDL repositories to generate synthetic code snippets paired with corresponding assertions. Experimental results demonstrate that fine-tuned models like Deepseek Coder 6.7B and Llama 3.1 8B outperform GPT-4o, achieving up to 96.88% improvement over base models and 24.14% over GPT-4o on platforms including OpenTitan, CVA6, OpenPiton and Pulpissimo. VERT is available at https://github.com/AnandMenon12/VERT.
中文: VERT是一个开源数据集,通过微调开源大语言模型来自动生成SystemVerilog断言,在精度和效率上显著超越GPT-4o等专有模型,同时无需授权费用。
English: VERT is an open-source dataset that enables fine-tuning of open-source LLMs to automate SystemVerilog assertion generation, significantly outperforming proprietary models like GPT-4o in accuracy and efficiency while eliminating licensing costs.

Authors:Matthieu Terris, Samuel Hurault, Maxime Song, Julian Tachella
Title: Reconstruct Anything Model: a lightweight foundation model for computational imaging
Abstract:
Most existing learning-based methods for solving imaging inverse problems can be roughly divided into two classes: iterative algorithms, such as plug-and-play and diffusion methods leveraging pretrained denoisers, and unrolled architectures that are trained end-to-end for specific imaging problems. Iterative methods in the first class are computationally costly and often yield suboptimal reconstruction performance, whereas unrolled architectures are generally problem-specific and require expensive training. In this work, we propose a novel non-iterative, lightweight architecture that incorporates knowledge about the forward operator (acquisition physics and noise parameters) without relying on unrolling. Our model is trained to solve a wide range of inverse problems, such as deblurring, magnetic resonance imaging, computed tomography, inpainting, and super-resolution, and handles arbitrary image sizes and channels, such as grayscale, complex, and color data. The proposed model can be easily adapted to unseen inverse problems or datasets with a few fine-tuning steps (up to a few images) in a self-supervised way, without ground-truth references. Throughout a series of experiments, we demonstrate state-of-the-art performance from medical imaging to low-photon imaging and microscopy. Our code is available at https://github.com/matthieutrs/ram.
中文: 本文提出了一种轻量级、非迭代的模型,该模型结合前向算子知识,能高效解决多种成像逆问题,并通过少量微调在各类应用中实现最先进的性能。
English: This paper introduces a lightweight, non-iterative model that incorporates forward operator knowledge to solve diverse imaging inverse problems efficiently, achieving state-of-the-art performance across various applications with minimal fine-tuning.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
Abstract:
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT
中文摘要:本文提出了一种基于最优传输的提示学习框架,通过保持特征分布一致性来增强视觉语言模型的适应能力,无需额外技术即可在多项任务中实现卓越的泛化性能。
English Summary: This paper introduces an optimal transport-guided prompt learning framework to enhance vision-language model adaptation by preserving feature distribution consistency, achieving superior generalization across various tasks without extra techniques.

Authors:Zhiwen You, Yue Guo
Title: PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization
Abstract:
Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents the first evaluation metric designed for PLS factual consistency evaluation, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact
Chinese: 针对大语言模型在医学领域产生幻觉内容对非专业受众的风险,我们提出了PlainQAFact评估指标,通过先分类句子类型再应用检索增强的问答评分方法,在评估医学通俗化摘要的事实一致性方面全面优于现有方法。
English: Large language models' medical hallucinations pose risks to lay audiences, prompting the development of PlainQAFact, a novel metric that outperforms existing methods in evaluating factual consistency in plain language medical summaries by first classifying sentence types and applying retrieval-augmented QA scoring.

Authors:Rajitha de Silva, Jonathan Cox, Marija Popovic, Cesar Cadena, Cyrill Stachniss, Riccardo Polvara
Title: Keypoint Semantic Integration for Improved Feature Matching in Outdoor Agricultural Environments
Abstract:
Robust robot navigation in outdoor environments requires accurate perception systems capable of handling visual challenges such as repetitive structures and changing appearances. Visual feature matching is crucial to vision-based pipelines but remains particularly challenging in natural outdoor settings due to perceptual aliasing. We address this issue in vineyards, where repetitive vine trunks and other natural elements generate ambiguous descriptors that hinder reliable feature matching. We hypothesise that semantic information tied to keypoint positions can alleviate perceptual aliasing by enhancing keypoint descriptor distinctiveness. To this end, we introduce a keypoint semantic integration technique that improves the descriptors in semantically meaningful regions within the image, enabling more accurate differentiation even among visually similar local features. We validate this approach in two vineyard perception tasks: (i) relative pose estimation and (ii) visual localisation. Across all tested keypoint types and descriptors, our method improves matching accuracy by 12.6%, demonstrating its effectiveness over multiple months in challenging vineyard conditions.
中文: 本文提出一种关键点语义集成技术,通过结合语义信息增强葡萄园环境中特征描述符的区分度,在姿态估计和视觉定位任务中将特征匹配准确率提升12.6%,有效应对户外复杂场景的视觉挑战。
English: This paper introduces a keypoint semantic integration technique that enhances descriptor distinctiveness in vineyards by incorporating semantic information, improving feature matching accuracy by 12.6% for tasks like pose estimation and visual localization under challenging outdoor conditions.

Authors:Nithin Parsan, David J. Yang, John J. Yang
Title: Towards Interpretable Protein Structure Prediction with Sparse Autoencoders
Abstract:
Protein language models have revolutionized structure prediction, but their nonlinear nature obscures how sequence representations inform structure prediction. While sparse autoencoders (SAEs) offer a path to interpretability here by learning linear representations in high-dimensional space, their application has been limited to smaller protein language models unable to perform structure prediction. In this work, we make two key advances: (1) we scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time, and (2) we adapt Matryoshka SAEs for protein language models, which learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently. We demonstrate that our Matryoshka SAEs achieve comparable or better performance than standard architectures. Through comprehensive evaluations, we show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction. Finally, we present an initial case study demonstrating how our approach enables targeted steering of ESMFold predictions, increasing structure solvent accessibility while fixing the input sequence. To facilitate further investigation by the broader community, we open-source our code, dataset, pretrained models https://github.com/johnyang101/reticular-sae , and visualizer https://sae.reticular.ai .
中文: 本研究将稀疏自编码器扩展至ESM2-3B模型,首次实现了蛋白质结构预测的可解释性分析,并通过改进的套娃式稀疏自编码器在性能上超越传统架构,同时展示了调控结构预测的实际应用能力。
English: This study scales sparse autoencoders to the ESM2-3B model, enabling interpretable protein structure prediction and introducing Matryoshka SAEs that outperform standard architectures while demonstrating practical applications in steering structural predictions.

Authors:Wenyi Wu, Hao Zhang, Zhisen Wei, Xiao-Yuan Jing, Qinghua Zhang, Songsong Wu
Title: Source-free domain adaptation based on label reliability for cross-domain bearing fault diagnosis
Abstract:
Source-free domain adaptation (SFDA) has been exploited for cross-domain bearing fault diagnosis without access to source data. Current methods select partial target samples with reliable pseudo-labels for model adaptation, which is sub-optimal due to the ignored target samples. We argue that every target sample can contribute to model adaptation, and accordingly propose in this paper a novel SFDA-based approach for bearing fault diagnosis that exploits both reliable and unreliable pseudo-labels. We develop a data-augmentation-based label voting strategy to divide the target samples into reliable and unreliable ones. We propose to explore the underlying relation between feature space and label space by using the reliable pseudo-labels as ground-truth labels, meanwhile, alleviating negative transfer by maximizing the entropy of the unreliable pseudo-labels. The proposed method achieves well-balance between discriminability and diversity by taking advantage of reliable and unreliable pseudo-labels. Extensive experiments are conducted on two bearing fault benchmarks, demonstrating that our approach achieves significant performance improvements against existing SFDA-based bearing fault diagnosis methods. Our code is available at https://github.com/BdLab405/SDALR.
中文: 本文提出了一种新的无源域自适应轴承故障诊断方法,通过标签投票策略和熵最大化同时利用可靠与不可靠伪标签,在基准测试中实现了优于现有方法的性能表现。
English: This paper introduces a novel source-free domain adaptation method for bearing fault diagnosis that leverages both reliable and unreliable pseudo-labels through a label voting strategy and entropy maximization, achieving superior performance on benchmark datasets.

Authors:Letian Zhang, Quan Cui, Bingchen Zhao, Cheng Yang
Title: Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Abstract:
The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.
中文:Oasis方法仅利用图像即可自动合成高质量多模态训练数据,在保证质量的同时大幅提升数据多样性,显著增强多模态大语言模型的性能。
English: The Oasis method enables automatic synthesis of high-quality multi-modal training data using only images, significantly enhancing data diversity and MLLM performance without compromising quality.

Authors:In Cho, Youngbeom Yoo, Subin Jeon, Seon Joo Kim
Title: Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models
Abstract:
Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models. This paper introduces COD-VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality. COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency. First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency. Experimental results demonstrate that COD-VAE achieves 16x compression compared to the baseline while maintaining quality. This enables 20.8x speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation. The code is available at https://github.com/join16/COD-VAE.
中文: 本文提出的COD-VAE通过两阶段变分自编码器将3D形状压缩为紧凑的一维潜向量,在保持质量的同时实现了16倍压缩比和20.8倍生成加速,其核心在于渐进式编码、三平面解码和自适应令牌剪枝技术。
English: This paper introduces COD-VAE, a two-stage variational autoencoder that compresses 3D shapes into compact 1D latent vectors while maintaining quality, achieving 16x compression and 20.8x faster generation through progressive encoding, triplane-based decoding, and adaptive token pruning.

Authors:Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
Title: Is CLIP ideal? No. Can we fix it? Yes!
Abstract:
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
中文: 研究发现CLIP的潜在空间存在固有几何限制,无法同时处理多种语义任务,并提出密集余弦相似度映射作为可解释的解决方案,通过保持图像-文本拓扑结构提升了各项基准测试的性能。
English: The study reveals that CLIP's latent space has inherent geometric limitations preventing it from simultaneously handling multiple semantic tasks, and proposes Dense Cosine Similarity Maps as an interpretable solution that improves performance across benchmarks by preserving image-text topology.

Authors:Yan Hu, Ahmad Chaddad
Title: SHAP-Integrated Convolutional Diagnostic Networks for Feature-Selective Medical Analysis
Abstract:
This study introduces the SHAP-integrated convolutional diagnostic network (SICDN), an interpretable feature selection method designed for limited datasets, to address the challenge posed by data privacy regulations that restrict access to medical datasets. The SICDN model was tested on classification tasks using pneumonia and breast cancer datasets, demonstrating over 97% accuracy and surpassing four popular CNN models. We also integrated a historical weighted moving average technique to enhance feature selection. The SICDN shows potential in medical image prediction, with the code available on https://github.com/AIPMLab/SICDN.
中文摘要:SICDN模型作为一种可解释的特征选择方法,在有限医疗数据集上实现了超过97%的分类准确率,有效应对了数据隐私保护带来的挑战。
English Summary: The SICDN model is an interpretable feature selection method that achieves over 97% accuracy on medical datasets while addressing data privacy constraints through effective performance with limited data.

Authors:Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji
Title: QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Abstract:
Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.
中文摘要:本文提出QuoTA,一种无需训练即可基于查询相关性在跨模态交互前筛选视觉标记的模块,在相同标记预算下将六个基准测试的平均性能提升了3.2%。
English Summary: The paper introduces QuoTA, a training-free module that enhances long video understanding by selecting visual tokens based on query relevance before cross-modal interactions, improving performance by 3.2% on benchmarks within the same token budget.

Authors:Ariba Khan, Stephen Casper, Dylan Hadfield-Menell
Title: Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Abstract:
Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.
中文: 当前基于调查的大语言模型文化对齐评估缺乏系统性稳健性检验,研究发现其存在不稳定性、不连贯性和不可控性,易导致结果失真,亟需更严谨的评估方法。
English: Current survey-based evaluations of LLMs' cultural alignment overlook systematic robustness checks, revealing instability, incoherence, and erratic behavior that can lead to misleading results, highlighting the need for more rigorous assessment methods.

Authors:Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
Title: OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
Abstract:
Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba
中文:OmniMamba是一种基于线性架构的多模态生成模型,通过统一的下一词预测范式高效生成文本和图像,在显著降低计算复杂度和训练数据需求的同时实现了优越性能。
English: OmniMamba is a linear-architecture multimodal generation model that efficiently generates text and images using next-token prediction, achieving competitive performance with significantly reduced computational complexity and training data requirements.

Authors:Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen
Title: Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
Abstract:
Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap.
Chinese: PLM检索模型因基于困惑度的排序而偏好LLM生成内容,为此提出的CDC方法通过诊断和修正偏差效应,在多个领域实现了有效的去偏处理。
English: PLM-based retrieval models exhibit source bias by favoring LLM-generated content due to perplexity-based ranking, prompting the development of the CDC method that effectively diagnoses and corrects this bias across multiple domains.

Authors:Changxing Liu, Genjia Liu, Zijun Wang, Jinchang Yang, Siheng Chen
Title: CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving
Abstract:
Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an actor-critic paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. Code will be released on https://github.com/cxliu0314/CoLMDriver.
Chinese: CoLMDriver是首个基于大语言模型的全流程协同驾驶系统,通过基于语言的协商和实时控制提升车辆间安全性,在多样化交互场景中相比现有方法成功率提高了11%。
English: CoLMDriver is a pioneering full-pipeline LLM-based cooperative driving system that enhances vehicle-to-vehicle safety through language-based negotiation and real-time control, achieving an 11% higher success rate in diverse interactive scenarios compared to existing methods.

Authors:Viktor Moskvoretskii, Chris Biemann, Irina Nikishina
Title: Self-Taught Self-Correction for Small Language Models
Abstract:
Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.
Chinese: 本研究提出自我教学式修正(STaSC)算法,通过仅使用自生成数据进行迭代微调,使小型语言模型实现自我纠错,在问答任务上显著提升性能,并为自我修正机制及设计选择的影响提供了深入见解。
English: This study introduces the Self-Taught Self-Correction (STaSC) algorithm, enabling small language models to self-correct through iterative fine-tuning with self-generated data, significantly improving performance on question-answering tasks while providing insights into self-correction mechanisms.

Authors:Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan
Title: SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
Abstract:
As language agents increasingly automate critical tasks, their ability to follow domain-specific standard operating procedures (SOPs), policies, and constraints when taking actions and making tool calls becomes essential yet remains underexplored. To address this gap, we develop an automated evaluation pipeline SOPBench with: (1) executable environments containing 167 tools/functions across seven customer service domains with service-specific SOPs and rule-based verifiers, (2) an automated test generation framework producing over 900 verified test cases, and (3) an automated evaluation framework to rigorously assess agent adherence from multiple dimensions. Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions. The original code serves as oracle rule-based verifiers to assess compliance, reducing reliance on manual annotations and LLM-based evaluations. We evaluate 18 leading models, and results show the task is challenging even for top-tier models (like GPT-4o, Claude-3.7-Sonnet), with variances across domains. Reasoning models like o4-mini-high show superiority while other powerful models perform less effectively (pass rates of 30%-50%), and small models (7B, 8B) perform significantly worse. Additionally, language agents can be easily jailbroken to overlook SOPs and constraints. Code, data, and over 24k agent trajectories are released at https://github.com/Leezekun/SOPBench.
中文: SOPBench开发了一套自动化评估体系,通过七个客服领域的167种工具和900多个测试用例系统检验语言智能体对标准流程的遵循能力,研究发现即使顶尖模型也存在明显性能差异且易被突破规则限制。
English: SOPBench introduces an automated evaluation pipeline to rigorously test language agents' adherence to domain-specific procedures across seven service domains, revealing significant performance gaps even among top models while exposing vulnerabilities to constraint violations.

Authors:Shuaiting Li, Juncan Deng, Chenxuan Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, Kejie Huang
Title: SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting
Abstract:
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3$\times$ speedup over the 8-bit compressed model by reducing memory access. Our code is available at https://github.com/list0830/SSVQ.
中文: 本文提出符号分离向量量化(SSVQ)方法,通过将权重符号位与码本解耦来改进传统向量量化在微调中的局限性,在保持更高压缩精度的同时实现了硬件加速器上3倍的运算速度提升。
English: This paper introduces Sign-Splitting Vector Quantization (SSVQ), a novel weight compression method that decouples sign bits from codebooks to overcome conventional VQ's fine-tuning limitations, achieving superior compression-accuracy trade-offs and 3× speedup on hardware accelerators.

Authors:Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy
Title: MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
Abstract:
Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.
中文: 本文提出MEAT这一针对人体的多视角扩散模型,通过网格注意力机制利用着装人体网格建立跨视角坐标对应,实现了1024x1024分辨率的高清训练,在保证视角一致性的同时有效解决了数据稀缺问题,其性能优于现有多视角扩散方法。
English: This paper introduces MEAT, a human-specific multiview diffusion model that employs mesh attention to enable high-resolution (1024x1024) training by establishing direct cross-view correspondences through clothed human mesh representations, effectively addressing consistency and data scarcity issues while outperforming existing methods.

Authors:Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Zhengxuan Jiang, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang, Yatian Wang, Xiaowei Chi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Shansong Liu, Lingrui Mei, Peng Li, Junjie Wang, Jianwei Yu, Guojian Pang, Xu Li, Zihao Wang, Xiaohuan Zhou, Lijun Yu, Emmanouil Benetos, Yong Chen, Chenghua Lin, Xie Chen, Gus Xia, Zhaoxiang Zhang, Chao Zhang, Wenhu Chen, Xinyu Zhou, Xipeng Qiu, Roger Dannenberg, Jiaheng Liu, Jian Yang, Wenhao Huang, Wei Xue, Xu Tan, Yike Guo
Title: YuE: Scaling Open Foundation Models for Long-Form Music Generation
Abstract:
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
中文: YuE模型系列基于LLaMA2架构,通过创新的训练方法生成长达五分钟、歌词对齐且结构连贯的音乐,并在生成与理解任务中展现出与顶尖系统相媲美的性能。
English: The YuE model family, built on LLaMA2, advances long-form music generation by producing up to five minutes of lyrically aligned, structurally coherent music through innovative training techniques and achieves competitive performance in both generation and understanding tasks.

Authors:Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen
Title: SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Abstract:
While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.
中文: 摘要提出了人类式掩码标注任务(HLMAT),该方法通过多步迭代生成基于文本的点击点,使多模态大语言模型能够进行像素级分割,无需改变架构即可获得高质量结果,并提升细粒度视觉理解能力。
English: The abstract introduces the Human-Like Mask Annotation Task (HLMAT), a method enabling multimodal large language models to perform pixel-level segmentation by generating text-based click points iteratively, achieving high-quality results without architectural changes and enhancing fine-grained visual understanding.

Authors:Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim
Title: LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization
Abstract:
Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen
中文摘要:LightGen通过知识蒸馏和直接偏好优化技术,仅用7亿参数和少量计算资源就实现了与顶尖模型相当的图像生成质量。
English Summary: LightGen introduces an efficient training paradigm using knowledge distillation and Direct Preference Optimization to achieve state-of-the-art image generation quality with only 0.7B parameters and minimal computational resources.

Authors:Feiran Wang, Jiachen Tao, Junyi Wu, Haoxuan Wang, Bin Duan, Kai Wang, Zongxin Yang, Yan Yan
Title: X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction
Abstract:
X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imaging that emphasizes reflection and scattering effects, while neglecting penetration and attenuation properties of X-ray imaging. In this paper, we introduce X-Field, the first 3D representation specifically designed for X-ray imaging, rooted in the energy absorption rates across different materials. To accurately model diverse materials within internal structures, we employ 3D ellipsoids with distinct attenuation coefficients. To estimate each material's energy absorption of X-rays, we devise an efficient path partitioning algorithm accounting for complex ellipsoid intersections. We further propose hybrid progressive initialization to refine the geometric accuracy of X-Filed and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray Novel View Synthesis and CT Reconstruction.
中文摘要:X-Field首次提出专为X射线成像设计的3D表征方法,通过模拟材料特异性能量吸收,结合创新的几何优化和材质边界处理技术,在X射线新视角合成和CT重建任务中实现了超越现有方法的视觉保真度。
English Summary: X-Field introduces the first 3D representation tailored for X-ray imaging by modeling material-specific energy absorption, achieving superior performance in novel view synthesis and CT reconstruction through innovative geometric and material optimization techniques.

Authors:Justus Karlsson, Yonghao Xu, Amanda Berg, Leif Haglund
Title: Comparing Next-Day Wildfire Predictability of MODIS and VIIRS Satellite Data
Abstract:
Multiple studies have performed next-day fire prediction using satellite imagery. Two main satellites are used to detect wildfires: MODIS and VIIRS. Both satellites provide fire mask products, called MOD14 and VNP14, respectively. Studies have used one or the other, but there has been no comparison between them to determine which might be more suitable for next-day fire prediction. In this paper, we first evaluate how well VIIRS and MODIS data can be used to forecast wildfire spread one day ahead. We find that the model using VIIRS as input and VNP14 as target achieves the best results. Interestingly, the model using MODIS as input and VNP14 as target performs significantly better than using VNP14 as input and MOD14 as target. Next, we discuss why MOD14 might be harder to use for predicting next-day fires. We find that the MOD14 fire mask is highly stochastic and does not correlate with reasonable fire spread patterns. This is detrimental for machine learning tasks, as the model learns irrational patterns. Therefore, we conclude that MOD14 is unsuitable for next-day fire prediction and that VNP14 is a much better option. However, using MODIS input and VNP14 as target, we achieve a significant improvement in predictability. This indicates that an improved fire detection model is possible for MODIS. The full code and dataset is available online: https://github.com/justuskarlsson/wildfire-mod14-vnp14
中文摘要:本研究比较了MODIS和VIIRS卫星数据在次日野火预测中的表现,发现VIIRS(VNP14)因MOD14的随机性而更优,但将MODIS输入与VNP14目标结合可显著提升预测效果。
English Summary: This study compares MODIS and VIIRS satellite data for next-day wildfire prediction, finding that VIIRS (VNP14) outperforms MODIS (MOD14) due to MOD14's unpredictable patterns, though combining MODIS input with VNP14 targets yields better results.

Authors:Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj
Title: Mellow: a small audio language model for reasoning
Abstract:
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
Chinese: Mellow是一种专为推理任务设计的小型音频语言模型,它在显著减少参数和训练数据的情况下实现了最先进的性能,并引入了一个新颖的数据集来增强基于音频的推理能力。
English: Mellow is a small Audio-Language Model designed for reasoning tasks, achieving state-of-the-art performance with significantly fewer parameters and less training data than larger models, while introducing a novel dataset to enhance audio-grounded reasoning.

Authors:Yihang Chen, Mengyao Li, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
Title: PCGS: Progressive Compression of 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. Specifically, for quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods. Code available at: github.com/YihangChen-ee/PCGS.
中文: 3D高斯泼溅技术虽渲染效果出色但数据量庞大,PCGS通过渐进式掩码策略和量化方法动态调控高斯点数量与质量,在保持压缩性能的同时实现了适用于按需应用的可扩展传输。
English: 3D Gaussian Splatting delivers high-quality rendering but faces data size challenges, so PCGS introduces progressive compression through adaptive control of Gaussian quantity and quality to enable efficient on-demand streaming while maintaining competitive compression rates.

Authors:Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan
Title: External Knowledge Injection for CLIP-Based Class-Incremental Learning
Abstract:
Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat'' can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE. Code is available at: https://github.com/LAMDA-CL/ICCV25-ENGINE
Chinese: 本文提出ENGINE方法,通过双分支调优和重排序机制将外部知识注入CLIP模型,在类增量学习中实现了最先进的性能,使模型能够更好地捕捉演化数据中的上下文特征。
English: This paper introduces ENGINE, a method that enhances Class-Incremental Learning (CIL) by injecting external knowledge into CLIP through dual-branch tuning and re-ranking, achieving state-of-the-art performance by capturing richer contextual features for evolving data streams.

Authors:Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang
Title: Referring to Any Person
Abstract:
Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
Chinese: 本研究提出了HumanRef数据集,以解决现有模型在通过自然语言描述检测多个人物方面的不足,并开发了RexSeek模型,该模型在人物指代任务中表现出色,并能有效泛化至物体检测应用。
English: This study introduces HumanRef, a new dataset addressing the limitations of existing models in detecting multiple individuals through natural language descriptions, and proposes RexSeek, a robust model that excels in human referring tasks and generalizes effectively to object detection.

Authors:Fan Wu, Sijun Dong, Xiaoliang Meng
Title: CFNet: Optimizing Remote Sensing Change Detection through Content-Aware Enhancement
Abstract:
Change detection is a crucial and widely applied task in remote sensing, aimed at identifying and analyzing changes occurring in the same geographical area over time. Due to variability in acquisition conditions, bi-temporal remote sensing images often exhibit significant differences in image style. Even with the powerful generalization capabilities of DNNs, these unpredictable style variations between bi-temporal images inevitably affect model's ability to accurately detect changed areas. To address issue above, we propose the Content Focuser Network (CFNet), which takes content-aware strategy as a key insight. CFNet employs EfficientNet-B5 as the backbone for feature extraction. To enhance the model's focus on the content features of images while mitigating the misleading effects of style features, we develop a constraint strategy that prioritizes the content features of bi-temporal images, termed Content-Aware. Furthermore, to enable the model to flexibly focus on changed and unchanged areas according to the requirements of different stages, we design a reweighting module based on the cosine distance between bi-temporal image features, termed Focuser. CFNet achieve outstanding performance across three well-known change detection datasets: CLCD (F1: 81.41%, IoU: 68.65%), LEVIR-CD (F1: 92.18%, IoU: 85.49%), and SYSU-CD (F1: 82.89%, IoU: 70.78%). The code and pretrained models of CFNet are publicly released at https://github.com/wifiBlack/CFNet.
中文摘要:提出的内容聚焦网络(CFNet)通过优先处理图像内容特征而非误导性风格差异,有效提升了遥感变化检测的准确性,并在三个权威数据集上取得了领先性能。
English Summary: The Content Focuser Network (CFNet) is proposed to enhance change detection in remote sensing by prioritizing content features over misleading style variations, achieving state-of-the-art performance on three benchmark datasets.

Authors:Yuncheng Guo, Xiaodong Gu
Title: MMRL: Multi-Modal Representation Learning for Vision-Language Models
Abstract:
Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.
中文摘要:提出的多模态表征学习(MMRL)框架通过构建共享表征空间,在优化表征和类别特征的同时与冻结预训练知识对齐,有效解决了小样本视觉语言任务中的过拟合问题,实现了任务适应性与泛化能力的平衡。
English Summary: The proposed Multi-Modal Representation Learning (MMRL) framework addresses overfitting in few-shot vision-language tasks by creating a shared representation space that optimizes both representation and class features while preserving generalization through alignment with frozen pre-trained knowledge.

Authors:Fengyi Zhang, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
Title: TT-Occ: Test-Time Compute for Self-Supervised Occupancy via Spatio-Temporal Gaussian Splatting
Abstract:
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VLMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VLMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VLMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VLM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions. Code will be available at https://github.com/Xian-Bei/TT-Occ.
自监督3D占据预测无需昂贵标注即可理解复杂驾驶场景,但存在训练效率低和适应性差的问题;我们提出的TT-Occ框架通过融合视觉基础模型与3D高斯方法,实现了无需训练、支持任意分辨率且具备开放词汇识别能力的实时占据预测。
Self-supervised 3D occupancy prediction enables complex scene understanding without expensive annotations, but faces challenges in training efficiency and adaptability, which our proposed TT-Occ framework overcomes by integrating vision foundation models with 3D Gaussians for flexible, training-free real-time occupancy prediction at arbitrary resolutions.

Authors:Han-Wei Kung, Tuomas Varanka, Terence Sim, Nicu Sebe
Title: NullFace: Training-Free Localized Face Anonymization
Abstract:
Privacy concerns around ever increasing number of cameras are increasing in today's digital age. Although existing anonymization methods are able to obscure identity information, they often struggle to preserve the utility of the images. In this work, we introduce a training-free method for face anonymization that preserves key non-identity-related attributes. Our approach utilizes a pre-trained text-to-image diffusion model without requiring optimization or training. It begins by inverting the input image to recover its initial noise. The noise is then denoised through an identity-conditioned diffusion process, where modified identity embeddings ensure the anonymized face is distinct from the original identity. Our approach also supports localized anonymization, giving users control over which facial regions are anonymized or kept intact. Comprehensive evaluations against state-of-the-art methods show our approach excels in anonymization, attribute preservation, and image quality. Its flexibility, robustness, and practicality make it well-suited for real-world applications. Code and data can be found at https://github.com/hanweikung/nullface .
中文: 本文提出了一种无需训练的人脸匿名化方法,利用预训练的扩散模型在保护非身份属性和图像质量的同时有效隐藏身份,并支持局部匿名化,适用于实际应用。
English: This paper introduces a training-free face anonymization method using a pre-trained diffusion model that effectively obscures identity while preserving non-identity attributes and image quality, supporting localized anonymization for practical use.

Authors:Zhuoguang Chen, Kenan Li, Xiuyu Yang, Tao Jiang, Yiming Li, Hang Zhao
Title: TrackOcc: Camera-based 4D Panoptic Occupancy Tracking
Abstract:
Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at https://github.com/Tsinghua-MARS-Lab/TrackOcc.
中文: 本文提出了基于摄像头的4D全景占用跟踪这一新任务,将全景占用分割与目标跟踪相结合,并开发了TrackOcc这一端到端方法,在Waymo数据集上实现了最先进的性能。
English: This paper introduces Camera-based 4D Panoptic Occupancy Tracking, a novel task combining panoptic occupancy segmentation and object tracking from camera-only input, and proposes TrackOcc, an end-to-end method that achieves state-of-the-art performance on the Waymo dataset.

Authors:Hsin-Ling Hsu, Ping-Sheng Lin, Jing-Di Lin, Jengnan Tzeng
Title: KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents
Abstract:
Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.
Chinese: 提出的知识感知预处理(KAP)框架通过多模态大语言模型优化OCR输出,有效解决了繁体中文非叙事文档的检索难题,显著提升了混合检索系统的性能。
English: The proposed Knowledge-Aware Preprocessing (KAP) framework effectively addresses the challenges of Traditional Chinese non-narrative documents by using multimodal LLMs to refine OCR outputs, significantly enhancing hybrid retrieval performance.

Authors:Chen Liao, Yan Shen, Dan Li, Zhongli Wang
Title: Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
Abstract:
Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct the image. Codes are available at https://github.com/FengodChen/DMP-DUN-CVPR2025.
Chinese: 深度展开网络(DUNs)利用预训练扩散模型的强大先验知识,在图像压缩感知中以更少的步骤实现高质量重建,所提出的DMP-DUN方法仅需两步即可达到最先进的性能。
English: Deep Unfolding Networks (DUNs) leverage the powerful prior knowledge of pre-trained diffusion models to achieve high-quality image reconstruction in compressive sensing with significantly fewer steps, as demonstrated by the proposed DMP-DUN method requiring only two steps for state-of-the-art results.

Authors:Lianting Wang, Marcelo Ponce
Title: Integrating Captive Portal Technology into Computer Science Education: A Modular, Hands-On Approach to Infrastructure
Abstract:
In this paper, we present an educational project aimed to introduce students to the technology behind Captive Portals infrastructures. For doing this, we developed a series of modules to emphasize each of the different aspects and features of this technology. The project is based on an open source implementation which is widely used in many computer network courses, making it well-suited and very appealing for instructors and practitioners in this field.
中文: 本文介绍了一个基于开源实现的教育项目,通过系列模块向学生讲解Captive Portals技术,非常适合计算机网络课程的师生和实践者使用。
English: This paper introduces an educational project using open-source modules to teach students about Captive Portals technology, designed specifically for computer network courses to benefit instructors and practitioners.

Authors:Qiming Xia, Wenkai Lin, Haoen Xiang, Xun Huang, Siheng Chen, Zhen Dong, Cheng Wang, Chenglu Wen
Title: Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
Abstract:
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering-based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently,DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks.The code is available at https://github.com/xmuqimingxia/DOtA.
Chinese: DOtA是一种无监督3D物体检测方法,通过多智能体激光雷达数据,利用多尺度编码和特征学习生成高质量伪标签,在无需外部标注的情况下实现了最先进的性能。
English: DOtA is an unsupervised 3D object detection method that leverages multi-agent LiDAR data to generate high-quality pseudo-labels through multi-scale encoding and feature learning, achieving state-of-the-art performance without external annotations.

Authors:Zhanjie Zhang, Quanwei Zhang, Guangyuan Li, Junsheng Luan, Mengyuan Yang, Yun Wang, Lei Zhao
Title: DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank
Abstract:
Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: https://github.com/Jamie-Cheung/DyArtbank
中文:DyArtbank框架通过动态风格提示艺术库和关键内容特征提示模块,能够生成多样化且高度逼真的艺术风格化图像,同时保持内容结构,解决了传统风格迁移方法的局限性。
English: The DyArtbank framework introduces a Dynamic Style Prompt ArtBank and Key Content Feature Prompt module to generate diverse, highly realistic artistic stylized images while preserving content structure, addressing limitations of conventional style transfer methods.

Authors:Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Title: Recognition-Synergistic Scene Text Editing
Abstract:
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at https://github.com/ZhengyaoFang/RS-STE.
中文: RS-STE是一种创新的场景文本编辑方法,它将文本识别与编辑集成在统一框架中,利用识别模型隐式分离样式与内容,在基准测试中达到最优性能,并能提升下游识别任务的效果。
English: RS-STE is a novel scene text editing method that integrates text recognition and editing in a unified framework, leveraging the recognition model to implicitly disentangle style and content, achieving state-of-the-art performance on benchmarks and enhancing downstream recognition tasks.

Authors:Susu Sun, Dominique van Midden, Geert Litjens, Christian F. Baumgartner
Title: Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification
Abstract:
Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model's decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that offers user-friendly explanations and supports human intervention. Our approach employs a sparse autoencoder to discover human-interpretable concepts from the image feature space, which are then used to train ProtoMIL. The model represents predictions as linear combinations of concepts, making the decision process transparent. Furthermore, ProtoMIL allows users to perform model interventions by altering the input concepts. Experiments on two widely used pathology datasets demonstrate that ProtoMIL achieves a classification performance comparable to state-of-the-art MIL models while offering intuitively understandable explanations. Moreover, we demonstrate that our method can eliminate reliance on diagnostically irrelevant information via human intervention, guiding the model toward being right for the right reason. Code will be publicly available at https://github.com/ss-sun/ProtoMIL.
中文: ProtoMIL提出了一种可解释的多示例学习模型,通过人类可理解的概念进行透明预测并支持用户干预,在保持与先进方法相当性能的同时提供直观解释,用于组织病理学全切片图像分析。
English: ProtoMIL introduces an interpretable multiple instance learning model for histopathology whole slide image analysis that uses human-understandable concepts for transparent predictions and allows user intervention, achieving performance comparable to state-of-the-art methods while providing intuitive explanations.

Authors:Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
Title: Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens
Abstract:
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton
中文摘要:Latent Consistency Tokenizer (Layton) 通过集成变换器编码器、量化码本和潜在一致性解码器,仅用256个标记高效表示高分辨率图像,在图像重建保真度和文本到图像生成方面均超越现有最优方法。
English Summary: The Latent Consistency Tokenizer (Layton) efficiently represents high-resolution images using only 256 tokens by integrating a transformer encoder, quantized codebook, and latent consistency decoder, achieving superior reconstruction fidelity and outperforming state-of-the-art methods in text-to-image generation.

Authors:Fabian Isensee, Maximilian Rokuss, Lars Krämer, Stefan Dinkelacker, Ashis Ravindran, Florian Stritzke, Benjamin Hamm, Tassilo Wald, Moritz Langenberg, Constantin Ulrich, Jonathan Deissler, Ralf Floca, Klaus Maier-Hein
Title: nnInteractive: Redefining 3D Promptable Segmentation
Abstract:
Accurate and efficient 3D segmentation is essential for both clinical and research applications. While foundation models like SAM have revolutionized interactive segmentation, their 2D design and domain shift limitations make them ill-suited for 3D medical images. Current adaptations address some of these challenges but remain limited, either lacking volumetric awareness, offering restricted interactivity, or supporting only a small set of structures and modalities. Usability also remains a challenge, as current tools are rarely integrated into established imaging platforms and often rely on cumbersome web-based interfaces with restricted functionality. We introduce nnInteractive, the first comprehensive 3D interactive open-set segmentation method. It supports diverse prompts-including points, scribbles, boxes, and a novel lasso prompt-while leveraging intuitive 2D interactions to generate full 3D segmentations. Trained on 120+ diverse volumetric 3D datasets (CT, MRI, PET, 3D Microscopy, etc.), nnInteractive sets a new state-of-the-art in accuracy, adaptability, and usability. Crucially, it is the first method integrated into widely used image viewers (e.g., Napari, MITK), ensuring broad accessibility for real-world clinical and research applications. Extensive benchmarking demonstrates that nnInteractive far surpasses existing methods, setting a new standard for AI-driven interactive 3D segmentation. nnInteractive is publicly available: https://github.com/MIC-DKFZ/napari-nninteractive (Napari plugin), https://www.mitk.org/MITK-nnInteractive (MITK integration), https://github.com/MIC-DKFZ/nnInteractive (Python backend).
Chinese: nnInteractive是首个全面的3D交互式开放集分割方法,支持多种提示并利用直观的2D交互生成完整3D分割,在精度和平台集成度上达到新标杆,已整合至主流医学影像平台。
English: nnInteractive is the first comprehensive 3D interactive open-set segmentation method that supports diverse prompts and leverages 2D interactions to generate full 3D segmentations, achieving state-of-the-art accuracy and broad integration into popular imaging platforms.

Authors:Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Title: Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis
Abstract:
Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.
中文: 本文提出一种潜在扰动方法模拟生成过程中的采样噪声,开发了新的分词器评估指标pFID和即插即用训练方案,显著提升了分词器鲁棒性和图像生成质量。
English: This paper introduces a latent perturbation method to simulate generative sampling noise, proposing a new tokenizer evaluation metric (pFID) and a plug-and-play training scheme that significantly improves tokenizer robustness and generation quality.

Authors:Bin Huang, Binzhong He, Yanhan Chen, Zhili Liu, Xinyue Wang, Binxuan Li, Qiegen Liu
Title: Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework
Abstract:
Deep learning has significantly advanced PET image re-construction, achieving remarkable improvements in image quality through direct training on sinogram or image data. Traditional methods often utilize masks for inpainting tasks, but their incorporation into PET reconstruction frameworks introduces transformative potential. In this study, we pro-pose an advanced PET reconstruction framework called Diffusion tRansformer mEets rAndom Masks (DREAM). To the best of our knowledge, this is the first work to integrate mask mechanisms into both the sinogram domain and the latent space, pioneering their role in PET reconstruction and demonstrating their ability to enhance reconstruction fidelity and efficiency. The framework employs a high-dimensional stacking approach, transforming masked data from two to three dimensions to expand the solution space and enable the model to capture richer spatial rela-tionships. Additionally, a mask-driven latent space is de-signed to accelerate the diffusion process by leveraging sinogram-driven and mask-driven compact priors, which reduce computational complexity while preserving essen-tial data characteristics. A hierarchical masking strategy is also introduced, guiding the model from focusing on fi-ne-grained local details in the early stages to capturing broader global patterns over time. This progressive ap-proach ensures a balance between detailed feature preservation and comprehensive context understanding. Experimental results demonstrate that DREAM not only improves the overall quality of reconstructed PET images but also preserves critical clinical details, highlighting its potential to advance PET imaging technology. By inte-grating compact priors and hierarchical masking, DREAM offers a promising and efficient avenue for future research and application in PET imaging. The open-source code is available at: https://github.com/yqx7150/DREAM.
中文摘要:DREAM框架通过在正弦图和潜在空间中引入掩码机制,结合分层掩码策略和紧凑先验,显著提升了PET图像重建质量与效率,同时保留了关键临床细节。
English Summary: The DREAM framework introduces a novel PET image reconstruction method by integrating mask mechanisms in both sinogram and latent domains, employing hierarchical masking and compact priors to enhance image quality while preserving clinical details efficiently.

Authors:Runwei Guan, Jianan Liu, Ningwei Ouyang, Shaofeng Liang, Daizong Liu, Xiaolou Sun, Lianqing Zheng, Ming Xu, Yutao Yue, Guoqiang Mao, Hui Xiong
Title: Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving
Abstract:
Embodied outdoor scene understanding forms the foundation for autonomous agents to perceive, analyze, and react to dynamic driving environments. However, existing 3D understanding is predominantly based on 2D Vision-Language Models (VLMs), which collect and process limited scene-aware contexts. In contrast, compared to the 2D planar visual information, point cloud sensors such as LiDAR provide rich depth and fine-grained 3D representations of objects. Even better the emerging 4D millimeter-wave radar detects the motion trend, velocity, and reflection intensity of each object. The integration of these two modalities provides more flexible querying conditions for natural language, thereby supporting more accurate 3D visual grounding. To this end, we propose a novel method called TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination, including both LiDAR and radar sensors. To optimally combine the features of these two sensors required by the prompt, we design a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Specifically, this paradigm initially employs Bidirectional Agent Cross-Attention (BACA), which feeds both-sensor features, characterized by global receptive fields, to the text features for querying. Moreover, we design a Dynamic Gated Graph Fusion (DGGF) module to locate the regions of interest identified by the queries. To further enhance accuracy, we devise an C3D-RECHead, based on the nearest object edge to the ego-vehicle. Experimental results demonstrate that our TPCNet, along with its individual modules, achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets. We release the code at https://github.com/GuanRunwei/TPCNet.
中文: 具身户外场景理解对自主智能体至关重要,提出的TPCNet模型通过融合激光雷达与雷达传感器,采用创新的多模态融合范式,实现了最先进的3D视觉定位性能。
English: Embodied outdoor scene understanding is crucial for autonomous agents, and the proposed TPCNet model integrates LiDAR and radar sensors with a novel fusion paradigm to achieve state-of-the-art 3D visual grounding accuracy.

Authors:Liang Yu, Lai Tu, Xiang Bai
Title: MFRS: A Multi-Frequency Reference Series Approach to Scalable and Accurate Time-Series Forecasting
Abstract:
Multivariate time-series forecasting holds immense value across diverse applications, requiring methods to effectively capture complex temporal and inter-variable dynamics. A key challenge lies in uncovering the intrinsic patterns that govern predictability, beyond conventional designs, focusing on network architectures to explore latent relationships or temporal dependencies. Inspired by signal decomposition, this paper posits that time series predictability is derived from periodic characteristics at different frequencies. Consequently, we propose a novel time series forecasting method based on multi-frequency reference series correlation analysis. Through spectral analysis on long-term training data, we identify dominant spectral components and their harmonics to design base-pattern reference series. Unlike signal decomposition, which represents the original series as a linear combination of basis signals, our method uses a transformer model to compute cross-attention between the original series and reference series, capturing essential features for forecasting. Experiments on major open and synthetic datasets show state-of-the-art performance. Furthermore, by focusing on attention with a small number of reference series rather than pairwise variable attention, our method ensures scalability and broad applicability. The source code is available at: https://github.com/yuliang555/MFRS
中文摘要:本文提出了一种新颖的时间序列预测方法,通过多频参考序列和基于Transformer的交叉注意力机制来捕捉关键时序特征,在实现优异预测性能的同时保证了方法的可扩展性。
English Summary: This paper introduces a novel time series forecasting method that uses multi-frequency reference series and transformer-based cross-attention to capture essential temporal patterns, achieving state-of-the-art performance with enhanced scalability.

Authors:Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral
Title: Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Abstract:
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models. The code is publicly available at https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap.
中文摘要:本研究揭示了大批量语言模型推理因DRAM带宽饱和仍受内存限制,提出了批处理配置顾问来优化内存使用,并通过模型复制实现并发工作负载以提升资源利用率。
English Summary: This study reveals that large-batch inference for language models remains memory-bound due to DRAM bandwidth saturation, and proposes a Batching Configuration Advisor to optimize memory usage while enabling concurrent workloads through model replication.

Authors:Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation
Abstract:
Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.
中文: 本文提出$^R$-FLAV模型,采用基于Transformer的架构,通过高效的跨模态融合模块解决了音视频生成中的质量、同步性和时序一致性难题,在性能上超越了现有最优方法。
English: This paper introduces $^R$-FLAV, a transformer-based model that overcomes key challenges in joint audio-video generation by ensuring high-quality output, seamless synchronization, and temporal coherence through an efficient cross-modality fusion module, outperforming existing state-of-the-art methods.

Authors:Yu Tang Liu, Afonso Vale, Aamir Ahmad, Rodrigo Ventura, Meysam Basiri
Title: Multitask Reinforcement Learning for Quadcopter Attitude Stabilization and Tracking using Graph Policy
Abstract:
Quadcopter attitude control involves two tasks: smooth attitude tracking and aggressive stabilization from arbitrary states. Although both can be formulated as tracking problems, their distinct state spaces and control strategies complicate a unified reward function. We propose a multitask deep reinforcement learning framework that leverages parallel simulation with IsaacGym and a Graph Convolutional Network (GCN) policy to address both tasks effectively. Our multitask Soft Actor-Critic (SAC) approach achieves faster, more reliable learning and higher sample efficiency than single-task methods. We validate its real-world applicability by deploying the learned policy - a compact two-layer network with 24 neurons per layer - on a Pixhawk flight controller, achieving 400 Hz control without extra computational resources. We provide our code at https://github.com/robot-perception-group/GraphMTSAC\_UAV/.
中文: 本研究提出一种多任务深度强化学习框架,结合SAC算法和图卷积网络,有效解决了四旋翼飞行器的平稳姿态跟踪与强扰动恢复控制问题,并在实际飞行控制器上实现了高效实时部署。
English: The study introduces a multitask deep reinforcement learning framework using SAC and GCN to efficiently handle both smooth tracking and aggressive stabilization in quadcopter control, demonstrating superior performance and real-time deployment on hardware.

Authors:Saad Sohail, Muhammad Usama, Usman Ghous, Manuel Mazzara, Salvatore Distefano, Muhammad Ahmad
Title: EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification
Abstract:
Hyperspectral imaging (HSI) provides rich spectral-spatial information across hundreds of contiguous bands, enabling precise material discrimination in applications such as environmental monitoring, agriculture, and urban analysis. However, the high dimensionality and spectral variability of HSI data pose significant challenges for feature extraction and classification. This paper presents EnergyFormer, a transformer-based framework designed to address these challenges through three key innovations: (1) Multi-Head Energy Attention (MHEA), which optimizes an energy function to selectively enhance critical spectral-spatial features, improving feature discrimination; (2) Fourier Position Embedding (FoPE), which adaptively encodes spectral and spatial dependencies to reinforce long-range interactions; and (3) Enhanced Convolutional Block Attention Module (ECBAM), which selectively amplifies informative wavelength bands and spatial structures, enhancing representation learning. Extensive experiments on the WHU-Hi-HanChuan, Salinas, and Pavia University datasets demonstrate that EnergyFormer achieves exceptional overall accuracies of 99.28\%, 98.63\%, and 98.72\%, respectively, outperforming state-of-the-art CNN, transformer, and Mamba-based models. The source code will be made available at https://github.com/mahmad000.
中文:EnergyFormer是一种基于Transformer的框架,通过多头部能量注意力、傅里叶位置编码和增强卷积注意力模块三大创新,有效应对高光谱影像的高维度和光谱变异性挑战,在多个基准数据集上实现了卓越的分类精度。
English: EnergyFormer is a transformer-based framework that introduces three innovations—Multi-Head Energy Attention, Fourier Position Embedding, and Enhanced Convolutional Block Attention Module—to effectively handle the high dimensionality and spectral variability of hyperspectral imaging data, achieving superior classification accuracy on benchmark datasets.

Authors:Ao Li, Zongfang Liu, Xinhua Li, Jinghui Zhang, Pengwei Wang, Hu Wang
Title: Modeling Variants of Prompts for Vision-Language Models
Abstract:
Large pre-trained vision-language models (VLMs) offer a promising approach to leveraging human language for enhancing downstream tasks. However, VLMs such as CLIP face significant limitation: its performance is highly sensitive to prompt template design. Although prompt learning methods can address the sensitivity issue by replacing natural language prompts with learnable ones, they are incomprehensible to humans. Ensuring consistent performance across various prompt templates enables models to adapt seamlessly to diverse phrasings, enhancing their ability to handle downstream tasks without requiring extensive prompt engineering. In this work, we introduce the RobustPrompt Benchmark, a systematic benchmark to evaluate robustness to different prompt templates for VLMs. It includes a dataset with hundreds of carefully designed prompt templates, divided into six types, covering a wide variety of commonly used templates. Beside the benchmark, we propose Modeling Variants of Prompts (MVP), a simple yet effective method that mitigates sensitivity by modeling variants of prompt structures. The innovation of MVP lies in decoupling prompts into templates and class names, and using Variational Autoencoders (VAE) to model the distribution of diverse prompt structures. Experiments across 11 datasets demonstrate that MVP can greatly enhance model robustness to variations in input prompts without a drop in performance. The code is available at https://github.com/liaolea/MVP.
中文摘要:大型视觉语言模型对提示模板设计敏感,而提出的MVP方法通过变分自编码器建模提示结构变体,在保持性能的同时显著提升了模型对不同提示模板的鲁棒性。
English Summary: Large vision-language models are sensitive to prompt design, but the proposed MVP method enhances robustness by modeling prompt variations with variational autoencoders, maintaining performance across diverse templates.

Authors:Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao
Title: EgoBlind: Towards Egocentric Visual Assistance for the Blind
Abstract:
We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 videos that record the daily lives of real blind users from a first-person perspective. It also features 5,311 questions directly posed or generated and verified by blind individuals to reflect their in-situation needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle, with the best performers achieving accuracy near 60\%, far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals' lives. Data and evaluation code are available at https://github.com/doc-doc/EgoBlind.
中文: EgoBlind是首个基于盲人视角的视觉问答数据集,研究表明现有MLLMs在视觉辅助任务中表现远逊于人类,最优模型准确率仅达60%,而人类表现高达87.4%。
English: EgoBlind is the first egocentric VideoQA dataset from blind individuals, revealing that current MLLMs perform significantly below human levels in visual assistance tasks, with the best model achieving only 60% accuracy compared to humans' 87.4%.

Authors:Jack Langerman, Denys Rozumnyi, Yuzhong Huang, Dmytro Mishkin
Title: Explaining Human Preferences via Metrics for Structured 3D Reconstruction
Abstract:
"What cannot be measured cannot be improved" while likely never uttered by Lord Kelvin, summarizes effectively the driving force behind this work. This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and an analysis through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations regarding which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed. The source code is available at https://github.com/s23dr/wireframe-metrics-iccv2025
中文: 本文提出了用于评估三维重建的自动化指标,通过专家偏好分析其局限性,并提出了经验验证的单元测试、情境感知建议以及基于人类判断的学习指标。
English: This paper introduces automated metrics for 3D reconstruction evaluation, analyzing their limitations through expert preferences and proposing validated unit tests, context-aware recommendations, and a learned metric based on human judgment.

Authors:Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He
Title: Route Sparse Autoencoder to Interpret Large Language Models
Abstract:
Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.
中文: RouteSAE通过结合路由机制和共享稀疏自编码器,有效提取大语言模型中跨多层的特征,在相同稀疏度下比基线方法提取更多特征且具有更高的可解释性。
English: RouteSAE introduces a routing mechanism with a shared sparse autoencoder to efficiently extract and interpret features across multiple layers in large language models, achieving higher feature count and interpretability scores than baseline methods.

Authors:Rui Xu, MingYu Wang, XinTao Wang, Dakuan Lu, Xiaoyu Tan, Wei Chu, Yinghui Xu
Title: Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents
Abstract:
Recent advances in LLM-based role-playing language agents (RPLAs) have attracted broad attention in various applications. While chain-of-thought reasoning has shown importance in many tasks for LLMs, the internal thinking processes of RPLAs remain unexplored. Understanding characters' inner thoughts is crucial for developing advanced RPLAs. In this paper, we introduce ROLETHINK, a novel benchmark constructed from literature for evaluating character thought generation. We propose the task of inner thought reasoning, which includes two sets: the gold set that compares generated thoughts with original character monologues, and the silver set that uses expert synthesized character analyses as references. To address this challenge, we propose MIRROR, a chain-of-thought approach that generates character thoughts by retrieving memories, predicting character reactions, and synthesizing motivations. Through extensive experiments, we demonstrate the importance of inner thought reasoning for RPLAs, and MIRROR consistently outperforms existing methods. Resources are available at https://github.com/airaer1998/RPA_Thought.
中文摘要:本文提出ROLETHINK基准用于评估角色扮演语言代理的角色思维生成,并开发了MIRROR思维链方法,该方法通过检索记忆和综合角色动机,在性能上持续优于现有方法。
English Summary: This paper introduces ROLETHINK, a benchmark for evaluating character thought generation in role-playing language agents, and proposes MIRROR, a chain-of-thought method that outperforms existing approaches by retrieving memories and synthesizing character motivations.

Authors:Zitong Shi, Guancheng Wan, Wenke Huang, Guibin Zhang, Jiawei Shao, Mang Ye, Carl Yang
Title: Privacy-Enhancing Paradigms within Federated Multi-Agent Systems
Abstract:
LLM-based Multi-Agent Systems (MAS) have proven highly effective in solving complex problems by integrating multiple agents, each performing different roles. However, in sensitive domains, they face emerging privacy protection challenges. In this paper, we introduce the concept of Federated MAS, highlighting the fundamental differences between Federated MAS and traditional FL. We then identify key challenges in developing Federated MAS, including: 1) heterogeneous privacy protocols among agents, 2) structural differences in multi-party conversations, and 3) dynamic conversational network structures. To address these challenges, we propose Embedded Privacy-Enhancing Agents (EPEAgent), an innovative solution that integrates seamlessly into the Retrieval-Augmented Generation (RAG) phase and the context retrieval stage. This solution minimizes data flows, ensuring that only task-relevant, agent-specific information is shared. Additionally, we design and generate a comprehensive dataset to evaluate the proposed paradigm. Extensive experiments demonstrate that EPEAgent effectively enhances privacy protection while maintaining strong system performance. The code will be availiable at https://github.com/ZitongShi/EPEAgent
中文: 本文针对敏感领域中的隐私保护挑战,提出了联邦多智能体系统概念,并设计EPEAgent解决方案,通过嵌入RAG阶段最小化数据共享,实验证明其在保持系统性能的同时有效增强了隐私保护。
English: This paper introduces Federated Multi-Agent Systems (MAS) to address privacy challenges in sensitive domains by proposing EPEAgent, a solution that integrates into the RAG phase to minimize data sharing while maintaining performance, with experimental validation confirming its effectiveness.

Authors:Yuan Tian, Kaiyuan Ji, Rongzhao Zhang, Yankai Jiang, Chunyi Li, Xiaosong Wang, Guangtao Zhai
Title: Towards All-in-One Medical Image Re-Identification
Abstract:
Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}{https://github.com/tianyuan168326/All-in-One-MedReID-Pytorch}.
中文摘要:本文针对医学图像重识别提出了一个全面基准和统一模型,通过新型连续模态参数适配器实现跨医学模态的自适应学习,并利用预训练医学基础模型的差异特征对齐整合医学先验知识,在多个数据集上展现出优越性能。
English Summary: This paper introduces a comprehensive benchmark and a unified model for medical image re-identification, featuring a novel Continuous Modality-based Parameter Adapter that enables adaptive learning across diverse medical modalities and integrates medical priors through differential feature alignment with pre-trained foundation models.

Authors:Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei
Title: Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding
Abstract:
Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.
Chinese: 本文通过提出新型视频数据集和具备自适应运动特征编码的深度辅助网络,解决了难以辨识海洋物体计数的挑战,在自建及现有数据集上均实现了最优性能。
English: This paper addresses the challenges of indiscernible marine object counting by introducing a novel video dataset and a depth-assisted network with adaptive motion-differentiated feature encoding, achieving state-of-the-art performance on both the proposed and existing datasets.

Authors:Huy Nguyen, Kien Nguyen, Akila Pemasiri, Feng Liu, Sridha Sridharan, Clinton Fookes
Title: AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
Abstract:
We introduce AG-VPReID, a new large-scale dataset for aerial-ground video-based person re-identification (ReID) that comprises 6,632 subjects, 32,321 tracklets and over 9.6 million frames captured by drones (altitudes ranging from 15-120m), CCTV, and wearable cameras. This dataset offers a real-world benchmark for evaluating the robustness to significant viewpoint changes, scale variations, and resolution differences in cross-platform aerial-ground settings. In addition, to address these challenges, we propose AG-VPReID-Net, an end-to-end framework composed of three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and facilitating temporal feature learning, (2) a Normalized Appearance Stream leveraging physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. We integrate visual-semantic cues from all streams to form a robust, viewpoint-invariant whole-body representation. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. Nevertheless, the performance gap observed on AG-VPReID across all methods underscores the dataset's challenging nature. The dataset, code and trained models are available at https://github.com/agvpreid25/AG-VPReID-Net.
中文: 我们提出了AG-VPReID大规模空地视频行人重识别数据集,并开发了AG-VPReID-Net端到端框架,该框架通过三个互补流集成视觉语义线索,在应对视角和尺度变化方面展现出优越性能。
English: We introduce AG-VPReID, a large-scale aerial-ground video dataset for person re-identification, and propose AG-VPReID-Net, an end-to-end framework with three complementary streams that integrates visual-semantic cues to achieve robust performance across challenging viewpoint and scale variations.

Authors:Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Title: ACE: Concept Editing in Diffusion Models without Performance Degradation
Abstract:
Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at https://github.com/littlelittlenine/ACE-zero.git
中文: ACE提出了一种交叉零空间投影方法,能在扩散模型中精确消除不安全概念的同时保持图像生成质量与语义一致性,以极低的时间成本显著提升了模型性能。
English: ACE introduces a cross null-space projection method to effectively remove unsafe concepts from diffusion models while preserving image quality and semantic consistency, achieving significant improvements in performance with minimal time cost.

Authors:Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, Jingbo Shang
Title: AI-native Memory 2.0: Second Me
Abstract:
Human interaction with the external world fundamentally involves the exchange of personal memory, whether with other individuals, websites, applications, or, in the future, AI agents. A significant portion of this interaction is redundant, requiring users to repeatedly provide the same information across different contexts. Existing solutions, such as browser-stored credentials, autofill mechanisms, and unified authentication systems, have aimed to mitigate this redundancy by serving as intermediaries that store and retrieve commonly used user data. The advent of large language models (LLMs) presents an opportunity to redefine memory management through an AI-native paradigm: SECOND ME. SECOND ME acts as an intelligent, persistent memory offload system that retains, organizes, and dynamically utilizes user-specific knowledge. By serving as an intermediary in user interactions, it can autonomously generate context-aware responses, prefill required information, and facilitate seamless communication with external systems, significantly reducing cognitive load and interaction friction. Unlike traditional memory storage solutions, SECOND ME extends beyond static data retention by leveraging LLM-based memory parameterization. This enables structured organization, contextual reasoning, and adaptive knowledge retrieval, facilitating a more systematic and intelligent approach to memory management. As AI-driven personal agents like SECOND ME become increasingly integrated into digital ecosystems, SECOND ME further represents a critical step toward augmenting human-world interaction with persistent, contextually aware, and self-optimizing memory systems. We have open-sourced the fully localizable deployment system at GitHub: https://github.com/Mindverse/Second-Me.
中文摘要:SECOND ME系统利用大语言模型实现AI原生的记忆管理,通过智能存储、组织和动态运用个人信息,有效减少人类与外部世界交互中的重复操作,提升交互效率。
English Summary: The SECOND ME system leverages large language models to create an AI-native memory management solution that reduces redundancy in human-world interactions by intelligently storing, organizing, and dynamically utilizing personal information across various contexts.

Authors:Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang
Title: Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning
Abstract:
Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at https://github.com/iseri27/tg_gbc.
中文: 提出的零样本剪枝方法tgGBC通过基于分类得分的注意力机制逐步修剪transformer模块中的关键键值,在3D目标检测中实现了近两倍的解码器加速,性能损失小于1%且部分模型性能反而提升,有效验证了边缘设备部署的可行性。
English: The proposed zero-shot pruning method, tgGBC, efficiently accelerates transformer decoders in 3D object detection by trimming keys based on classification-guided importance scores, achieving near 2x speedup with minimal performance loss and even enhancing some models when deployed on edge devices.

Authors:Kai Deng, Yigong Zhang, Jian Yang, Jin Xie
Title: GigaSLAM: Large-Scale Monocular SLAM with Hierarchical Gaussian Splats
Abstract:
Tracking and mapping in large-scale, unbounded outdoor environments using only monocular RGB input presents substantial challenges for existing SLAM systems. Traditional Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) SLAM methods are typically limited to small, bounded indoor settings. To overcome these challenges, we introduce GigaSLAM, the first RGB NeRF / 3DGS-based SLAM framework for kilometer-scale outdoor environments, as demonstrated on the KITTI, KITTI 360, 4 Seasons and A2D2 datasets. Our approach employs a hierarchical sparse voxel map representation, where Gaussians are decoded by neural networks at multiple levels of detail. This design enables efficient, scalable mapping and high-fidelity viewpoint rendering across expansive, unbounded scenes. For front-end tracking, GigaSLAM utilizes a metric depth model combined with epipolar geometry and PnP algorithms to accurately estimate poses, while incorporating a Bag-of-Words-based loop closure mechanism to maintain robust alignment over long trajectories. Consequently, GigaSLAM delivers high-precision tracking and visually faithful rendering on urban outdoor benchmarks, establishing a robust SLAM solution for large-scale, long-term scenarios, and significantly extending the applicability of Gaussian Splatting SLAM systems to unbounded outdoor environments. GitHub: https://github.com/DengKaiCQ/GigaSLAM.
中文: GigaSLAM是首个基于神经辐射场和3D高斯泼溅的RGB SLAM框架,通过分层地图表示和鲁棒的姿态估计,实现了公里级户外环境的高精度追踪与逼真渲染。
English: GigaSLAM is the first RGB-based SLAM framework using Neural Radiance Fields and 3D Gaussian Splatting that enables high-precision tracking and realistic rendering for kilometer-scale outdoor environments through hierarchical mapping and robust pose estimation.

Authors:Ali Veisi, Hamidreza Amirzadeh, Amir Mansourian
Title: Context-aware Biases for Length Extrapolation
Abstract:
Transformers often struggle to generalize to longer sequences than those seen during training, a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing either fixed linear biases or globally learned biases, which lack the capacity to adapt to different input contexts. In this work, we propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE), a method that learns token-specific, context-aware biases for each attention head in transformers. By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs. When evaluated on sequences longer than originally trained with, GPT-2 Medium (334M parameters) with CABLE achieves lower perplexity than counterparts using other widely adopted positional encoding methods. Additionally, by applying CABLE to the BERT base model we improved performance in long-context retrieval tasks. Our method significantly enhances the extrapolation performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets. Our code is available at: https://github.com/AlgonetLabs/Cable.
中文: 本文提出CABLE方法,一种上下文感知的相对位置编码技术,能根据输入序列动态调整位置偏置,显著提升了Transformer模型在训练未见过的长序列上的泛化性能。
English: The paper introduces CABLE, a context-aware relative positional encoding method that dynamically adjusts biases based on input sequences, significantly improving transformer models' performance on longer sequences than seen during training.

Authors:Nadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
Title: SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models
Abstract:
The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance. Code at https://github.com/nadarasarbahavan/SpHOR
中文摘要:SphOR是一种基于球形嵌入和冯·米塞斯-费希尔分布的高效开放集识别方法,在多个基准测试中实现了最高达6%的性能提升,达到领先水平。
English Summary: SphOR is an efficient open-set recognition method using spherical embeddings with von Mises-Fisher distributions, achieving state-of-the-art performance improvements up to 6% on benchmarks.

Authors:Sanghyuk Chun, Sangdoo Yun
Title: LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
Abstract:
Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip
中文: 本文提出了一种针对概率语言-图像预训练(ProLIP)的微调策略,使其能处理长达256个标记的文本序列,在Urban-1k和DataComp基准测试中验证了该方法在提升长文本理解能力的同时,尽可能保持了模型的通用零样本能力。
English: This paper introduces a fine-tuning strategy for Probabilistic Language-Image Pre-Training (ProLIP) to handle longer text sequences up to 256 tokens, improving long-context understanding while maintaining general zero-shot capabilities, as validated on Urban-1k and DataComp benchmarks.

Authors:Xuan Lu, Sifan Liu, Bochao Yin, Yongqi Li, Xinghao Chen, Hui Su, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen
Title: MultiConIR: Towards multi-condition Information Retrieval
Abstract:
Multi-condition information retrieval (IR) presents a significant, yet underexplored challenge for existing systems. This paper introduces MultiConIR, a benchmark specifically designed to evaluate retrieval and reranking models under nuanced multi-condition query scenarios across five diverse domains. We systematically assess model capabilities through three critical tasks: complexity robustness, relevance monotonicity, and query format sensitivity. Our extensive experiments on 15 models reveal a critical vulnerability: most retrievers and rerankers exhibit severe performance degradation as query complexity increases. Key deficiencies include widespread failure to maintain relevance monotonicity, and high sensitivity to query style and condition placement. The superior performance of GPT-4o reveals the performance gap between IR systems and advanced LLM for handling sophisticated natural language queries. Furthermore, this work delves into the factors contributing to reranker performance deterioration and examines how condition positioning within queries affects similarity assessment, providing crucial insights for advancing IR systems towards complex search scenarios. The code and datasets are available at https://github.com/EIT-NLP/MultiConIR
中文: MultiConIR基准测试表明,大多数检索和重排序模型在查询复杂度增加时性能显著下降,而GPT-4o在处理复杂多条件查询方面展现出卓越能力。
English: MultiConIR benchmark reveals that most retrieval and reranking models suffer significant performance degradation with increasing query complexity, while GPT-4o demonstrates superior capability in handling sophisticated multi-condition queries.

Authors:Ying Fu Lim, Jiawen Zhu, Guansong Pang
Title: Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection
Abstract:
Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs -- Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) -- to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at https://github.com/mala-lab/LogADReft.
中文: 本研究探索使用LoRA和ReFT等参数高效微调技术,将大语言模型适配于日志异常检测任务,并在多个数据集上从效能、稳定性等关键维度评估其性能。
English: This study explores parameter-efficient fine-tuning techniques like LoRA and ReFT to adapt large language models for log anomaly detection, evaluating their effectiveness across multiple datasets and key performance aspects.

Authors:Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Title: Generalized Kullback-Leibler Divergence Loss
Abstract:
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
中文: 本文从数学上证明了KL散度损失与解耦KL(DKL)损失的等价性,提出了一种广义KL(GKL)损失,通过改进非对称优化特性和引入类间全局信息,在多个数据集上实现了最先进的对抗鲁棒性和有竞争力的知识蒸馏性能。
English: This paper mathematically demonstrates the equivalence between KL Divergence loss and Decoupled KL (DKL) loss, proposing a Generalized KL (GKL) loss with two enhancements—addressing asymmetric optimization and incorporating class-wise global information—which achieves state-of-the-art adversarial robustness and competitive knowledge distillation performance across multiple datasets.

Authors:S M A Sharif, Rizwan Ali Naqvi, Mithun Biswas, Woong-Kee Loh
Title: Deep Perceptual Enhancement for Medical Image Analysis
Abstract:
Due to numerous hardware shortcomings, medical image acquisition devices are susceptible to producing low-quality (i.e., low contrast, inappropriate brightness, noisy, etc.) images. Regrettably, perceptually degraded images directly impact the diagnosis process and make the decision-making manoeuvre of medical practitioners notably complicated. This study proposes to enhance such low-quality images by incorporating end-to-end learning strategies for accelerating medical image analysis tasks. To the best concern, this is the first work in medical imaging which comprehensively tackles perceptual enhancement, including contrast correction, luminance correction, denoising, etc., with a fully convolutional deep network. The proposed network leverages residual blocks and a residual gating mechanism for diminishing visual artefacts and is guided by a multi-term objective function to perceive the perceptually plausible enhanced images. The practicability of the deep medical image enhancement method has been extensively investigated with sophisticated experiments. The experimental outcomes illustrate that the proposed method could outperform the existing enhancement methods for different medical image modalities by 5.00 to 7.00 dB in peak signal-to-noise ratio (PSNR) metrics and 4.00 to 6.00 in DeltaE metrics. Additionally, the proposed method can drastically improve the medical image analysis tasks' performance and reveal the potentiality of such an enhancement method in real-world applications. Code Available: https://github.com/sharif-apu/DPE_JBHI
本研究提出了一种开创性的全卷积深度网络,通过端到端学习全面提升低质量医学图像的感知质量,在图像增强指标和诊断任务性能上均显著优于现有方法。
This study introduces a pioneering fully convolutional deep network that comprehensively enhances low-quality medical images through end-to-end learning, significantly outperforming existing methods in both perceptual quality and diagnostic task performance.

Authors:Sudarshan Regmi
Title: AdaSCALE: Adaptive Scaling for OOD Detection
Abstract:
The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample's estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94 in near-OOD and 21.67 in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/
中文: AdaSCALE提出了一种自适应缩放方法,根据样本的分布外似然度动态调整百分位阈值,通过扰动下激活偏移增强分布内外样本的可分性,实现了最先进的异常检测性能。
English: AdaSCALE introduces an adaptive scaling method that dynamically adjusts percentile thresholds based on estimated out-of-distribution likelihood, achieving state-of-the-art OOD detection performance by enhancing ID-OOD separability through activation shifts under perturbation.

Authors:Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li
Title: Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Abstract:
As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model's image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at https://github.com/bzluan/AdaptPrune.
Chinese: AdaptPrune是一种无需训练的可视化令牌剪枝方法,通过结合注意力分数、空间距离和令牌相似性与自适应NMS,有效提升大型视觉语言模型的效率,并在多种剪枝比例下持续优于现有方法。
English: AdaptPrune is a training-free visual token pruning method that enhances LVLM efficiency by integrating attention scores, spatial distance, and token similarity with adaptive NMS, consistently outperforming existing methods across various pruning ratios.

Authors:Meghna Roy Chowdhury, Wei Xuan, Shreyas Sen, Yixue Zhao, Yi Ding
Title: Predicting and Understanding College Student Mental Health with Interpretable Machine Learning
Abstract:
Mental health issues among college students have reached critical levels, significantly impacting academic performance and overall wellbeing. Predicting and understanding mental health status among college students is challenging due to three main factors: the necessity for large-scale longitudinal datasets, the prevalence of black-box machine learning models lacking transparency, and the tendency of existing approaches to provide aggregated insights at the population level rather than individualized understanding. To tackle these challenges, this paper presents I-HOPE, the first Interpretable Hierarchical mOdel for Personalized mEntal health prediction. I-HOPE is a two-stage hierarchical model that connects raw behavioral features to mental health status through five defined behavioral categories as interaction labels. We evaluate I-HOPE on the College Experience Study, the longest longitudinal mobile sensing dataset. This dataset spans five years and captures data from both pre-pandemic periods and the COVID-19 pandemic. I-HOPE achieves a prediction accuracy of 91%, significantly surpassing the 60-70% accuracy of baseline methods. In addition, I-HOPE distills complex patterns into interpretable and individualized insights, enabling the future development of tailored interventions and improving mental health support. The code is available at https://github.com/roycmeghna/I-HOPE.
中文: 本文提出I-HOPE可解释分层模型,通过行为分析以91%的预测准确率解决大学生心理健康预测难题,并提供个性化解读。
English: This paper introduces I-HOPE, an interpretable hierarchical model that addresses challenges in predicting college students' mental health by achieving 91% accuracy and providing individualized insights through behavioral analysis.

Authors:Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Kyungsu Kim
Title: DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation
Abstract:
Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at https://github.com/shjo-april/DiffEGG.
Chinese: DiffEGG提出了一种无需标注的方法,利用预训练扩散模型生成精确的实例边缘图,克服了现有无监督和弱监督分割方法的局限性,并实现了显著的性能提升。
English: DiffEGG introduces an annotation-free method using pretrained diffusion models to generate precise instance edge maps, overcoming limitations of existing unsupervised and weakly-supervised segmentation methods while achieving significant performance improvements.

Authors:Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen
Title: Regulatory DNA sequence Design with Reinforcement Learning
Abstract:
Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at https://github.com/yangzhao1230/TACO.
中文: 本文提出了一种结合强化学习的生成方法,通过微调自回归模型并整合模拟转录因子结合位点修饰的生物先验知识,成功设计了适用于酵母和人类细胞的高适应性顺式调控元件。
English: This paper introduces a generative approach using reinforcement learning to fine-tune an autoregressive model for designing high-fitness cis-regulatory elements by incorporating biological priors that simulate transcription factor binding site modifications, demonstrating effectiveness across yeast and human cell applications.

Authors:Jiahao Xu, Zikai Zhang, Rui Hu
Title: Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection
Abstract:
The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model's performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. The code is available at https://github.com/JiiahaoXU/AlignIns.
Chinese: AlignIns是一种新颖的防御方法,通过检查模型更新的方向一致性,根据其与整体更新方向和参数符号分布的偏差来过滤恶意更新,从而保护联邦学习免受后门攻击。
English: AlignIns is a novel defense method that protects Federated Learning from backdoor attacks by inspecting the direction alignment of model updates and filtering out malicious ones based on their deviation from the overall update direction and parameter sign distribution.

Authors:Chen Liu, Feng Qiu, Wei Zhang, Lincheng Li, Dadong Wang, Xin Yu
Title: 7ABAW-Compound Expression Recognition via Curriculum Learning
Abstract:
With the advent of deep learning, expression recognition has made significant advancements. However, due to the limited availability of annotated compound expression datasets and the subtle variations of compound expressions, Compound Emotion Recognition (CE) still holds considerable potential for exploration. To advance this task, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition introduces the Compound Expression Challenge based on C-EXPR-DB, a limited dataset without labels. In this paper, we present a curriculum learning-based framework that initially trains the model on single-expression tasks and subsequently incorporates multi-expression data. This design ensures that our model first masters the fundamental features of basic expressions before being exposed to the complexities of compound emotions. Specifically, our designs can be summarized as follows: 1) Single-Expression Pre-training: The model is first trained on datasets containing single expressions to learn the foundational facial features associated with basic emotions. 2) Dynamic Compound Expression Generation: Given the scarcity of annotated compound expression datasets, we employ CutMix and Mixup techniques on the original single-expression images to create hybrid images exhibiting characteristics of multiple basic emotions. 3) Incremental Multi-Expression Integration: After performing well on single-expression tasks, the model is progressively exposed to multi-expression data, allowing the model to adapt to the complexity and variability of compound expressions. The official results indicate that our method achieves the \textbf{best} performance in this competition track with an F-score of 0.6063. Our code is released at https://github.com/YenanLiu/ABAW7th.
中文: 本文提出一种课程学习框架,先在单一表情数据上预训练模型,再逐步引入复合表情数据,最终在ABAW7竞赛中以0.6063的F值取得了最佳性能。
English: This paper introduces a curriculum learning framework that first trains on single-expression data and then incrementally incorporates multi-expression data, achieving the best performance in the ABAW7 competition with an F-score of 0.6063.

Authors:Andrew Gao, Jun Liu
Title: STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications
Abstract:
This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. Therefore, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems, with the goal of making them safer and more effective. Many detection systems have been developed with great success under spatial contexts; however, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference, i.e., autonomous driving where anomalies need to be detected the moment they are within view. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using (2+1)D Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When tested on the UCF-Crime benchmark, our base model achieves an AUC of 91.34%, outperforming the previous state-of-the-art, and our fast version achieves an AUC of 88.87%, while having 99.70% less parameters and outperforming the previous state-of-the-art as well. The code and pretrained models are made publicly available at https://github.com/agao8/STEAD
本文提出STEAD方法,通过(2+1)D卷积和Performer线性注意力实现高效异常检测,在UCF-Crime基准测试中以极少参数量达到最优性能。
This paper introduces STEAD, a computationally efficient anomaly detection method using (2+1)D Convolutions and Performer Linear Attention, achieving state-of-the-art performance on the UCF-Crime benchmark with significantly reduced parameters.

Authors:Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park
Title: BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Abstract:
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
Chinese: 当前基于深度学习的点云配准方法仍存在泛化性问题,需针对新环境重新训练或手动调参,但提出的BUFFER-X流程通过自适应参数设置、最远点采样和局部尺度归一化,在多种数据集上实现了无需先验知识或手动调参的零样本鲁棒性能。
English: Recent deep learning point cloud registration methods still face generalization issues, requiring retraining or manual tuning for new environments, but the proposed BUFFER-X pipeline overcomes these by adaptively setting parameters, using farthest point sampling, and applying patch-wise normalization to achieve robust zero-shot performance across diverse datasets.

Authors:Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M. Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Abstract:
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
中文摘要:SEA-VL是一个开源项目,旨在通过众包和自动化方法收集东南亚文化相关数据,以弥补该地区在视觉语言研究中的代表性不足,最终汇集了128万张图像,推动构建更具包容性的人工智能系统。
English Summary: SEA-VL is an open-source initiative addressing the underrepresentation of Southeast Asian cultures in vision-language research by creating culturally relevant datasets through crowdsourcing and automated methods, ultimately gathering 1.28M images to foster more inclusive AI systems.

Authors:Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti
Title: Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
Abstract:
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily employ objectives like contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models, which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation, remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. The source code is available at: https://github.com/yurujaja/SatDiFuser.
中文: SatDiFuser将生成式扩散模型转化为强大的判别式遥感预训练工具,通过融合多阶段扩散特征,在语义分割和分类任务中超越现有最优地理空间基础模型。
English: SatDiFuser transforms generative diffusion models into effective discriminative tools for remote sensing, achieving state-of-the-art performance in tasks like semantic segmentation and classification through innovative fusion of multi-stage diffusion features.

Authors:Anh-Kiet Duong
Title: Elderly Activity Recognition in the Wild: Results from the EAR Challenge
Abstract:
This paper presents our solution for the Elderly Action Recognition (EAR) Challenge, part of the Computer Vision for Smalls Workshop at WACV 2025. The competition focuses on recognizing Activities of Daily Living (ADLs) performed by the elderly, covering six action categories with a diverse dataset. Our approach builds upon a state-of-the-art action recognition model, fine-tuned through transfer learning on elderly-specific datasets to enhance adaptability. To improve generalization and mitigate dataset bias, we carefully curated training data from multiple publicly available sources and applied targeted pre-processing techniques. Our solution currently achieves 0.81455 accuracy on the public leaderboard, highlighting its effectiveness in classifying elderly activities. Source codes are publicly available at https://github.com/ffyyytt/EAR-WACV25-DAKiet-TSM.
Chinese: 本文介绍了老年人行为识别挑战的解决方案,通过优化先进模型并整合多源数据,在老年日常活动分类中取得了0.81455的准确率。
English: This paper details a solution for the Elderly Action Recognition Challenge, utilizing a fine-tuned state-of-the-art model and curated multi-source data to achieve 0.81455 accuracy in classifying daily activities of the elderly.

Authors:Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu
Title: Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM
Abstract:
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4$\times$ with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.
中文: 本文提出的分层平衡打包方法通过优化数据分组和训练配置,有效解决了长上下文大语言模型训练中的负载不均问题,在保持性能的同时实现了最高2.4倍的训练加速。
English: This paper introduces Hierarchical Balance Packing (HBP), a novel method that optimizes data grouping and training settings to resolve workload imbalances and inefficiencies in long-context LLM training, achieving up to 2.4× speedup without compromising performance.

Authors:Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang
Title: CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models
Abstract:
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at https://github.com/DDVD233/climb.
中文:临床大规模综合多模态基准(CLIMB)通过整合多样化的临床数据,为全面评估患者健康提供了基础,研究表明多任务预训练能显著提升超声和心电图等薄弱领域的性能及模型泛化能力。
English: The Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB) introduces a comprehensive dataset integrating diverse clinical data to advance holistic patient assessments, demonstrating that multitask pretraining significantly enhances performance and generalization in understudied domains like ultrasound and ECG analysis.

Authors:Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
Title: Cross-modal Causal Relation Alignment for Video Question Grounding
Abstract:
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
中文摘要:本文提出的跨模态因果关联对齐(CRA)框架通过高斯平滑定位、跨模态对齐和显式因果干预三大模块,有效消除视频问答任务中的伪相关性,显著提升了时序定位精度与问答推理的鲁棒性。
English Summary: The proposed Cross-modal Causal Relation Alignment (CRA) framework addresses spurious correlations in Video Question Grounding by integrating Gaussian smoothing, cross-modal alignment, and explicit causal intervention to improve temporal localization and reasoning robustness.

Authors:Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, Xinggang Wang
Title: AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Abstract:
OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.
中文:AlphaDrive为自动驾驶中的视觉语言模型提出了强化学习与推理框架,通过定制化强化学习奖励和双阶段训练策略,显著提升规划性能与效率,并展现出新兴多模态规划能力。
English: AlphaDrive introduces a reinforcement learning and reasoning framework for vision-language models in autonomous driving, employing tailored RL rewards and a two-stage training strategy to significantly enhance planning performance and efficiency while developing emergent multimodal capabilities.

Authors:Ying Xu, Marius Pedersen, Kiran Raja
Title: VoD: Learning Volume of Differences for Video-Based Deepfake Detection
Abstract:
The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at https://github.com/xuyingzhongguo/VoD.
中文: 本文提出差异量(VoD)框架,通过分析视频帧间的时空不一致性来提升深度伪造检测的准确性,并在多个数据集中展现出优异的性能和适应性。
English: This paper presents the Volume of Differences (VoD) framework, which improves Deepfake detection by analyzing temporal and spatial inconsistencies in video frames and demonstrates high accuracy and adaptability across datasets.

Authors:Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou
Title: Balanced Image Stylization with Style Matching Score
Abstract:
We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.
中文:风格匹配评分(SMS)方法将图像风格化重新定义为风格分布匹配问题,通过渐进频谱正则化和语义感知梯度优化,有效平衡风格转换与内容保留,性能优于现有技术。
English: The Style Matching Score (SMS) method reframes image stylization as a style distribution matching problem, utilizing Progressive Spectrum Regularization and Semantic-Aware Gradient Refinement to effectively balance style transfer with content preservation, outperforming existing approaches.

Authors:Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li
Title: When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Abstract:
Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.
中文摘要:本文提出一种结合动态图像金字塔的文本引导令牌剪枝方法,通过区域聚焦模块和由粗到细的图像块选择策略,在保持遥感图像细节的同时降低计算复杂度,并创建了包含7,333个问答对的新基准LRS-VQA用于评估。
English Summary: This paper introduces a text-guided token pruning method with Dynamic Image Pyramid integration to efficiently process large remote sensing images by preserving details while reducing computational costs, and establishes a new benchmark LRS-VQA for comprehensive evaluation.

Authors:Jen-tse Huang, Jiantong Qin, Jianping Zhang, Youliang Yuan, Wenxuan Wang, Jieyu Zhao
Title: VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
Abstract:
This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.
本研究通过直接提问和间接任务测试视觉语言模型在性别和种族方面的显性与隐性社会偏见,评估了包括Gemini-1.5和GPT-4V在内的多个模型。
This study examines explicit and implicit social biases in Vision-Language Models by testing them with direct questions and indirect tasks related to gender and race, evaluating models like Gemini-1.5 and GPT-4V.

Authors:Samuel Ferino, Rashina Hoda, John Grundy, Christoph Treude
Title: Novice Developers' Perspectives on Adopting LLMs for Software Development: A Systematic Literature Review
Abstract:
Following the rise of large language models (LLMs), many studies have emerged in recent years focusing on exploring the adoption of LLM-based tools for software development by novice developers: computer science/software engineering students and early-career industry developers with two years or less of professional experience. These studies have sought to understand the perspectives of novice developers on using these tools, a critical aspect of the successful adoption of LLMs in software engineering. To systematically collect and summarise these studies, we conducted a systematic literature review (SLR) following the guidelines by Kitchenham et al. on 80 primary studies published between April 2022 and June 2025 to answer four research questions (RQs). In answering RQ1, we categorised the study motivations and methodological approaches. In RQ2, we identified the software development tasks for which novice developers use LLMs. In RQ3, we categorised the advantages, challenges, and recommendations discussed in the studies. Finally, we discuss the study limitations and future research needs suggested in the primary studies in answering RQ4. Throughout the paper, we also indicate directions for future work and implications for software engineering researchers, educators, and developers. Our research artifacts are publicly available at https://github.com/Samuellucas97/SupplementaryInfoPackage-SLR.
中文摘要:本文通过系统文献综述分析了2022至2025年间80项研究,系统梳理了新手开发者使用大语言模型工具的开发场景、优势挑战及改进建议,并为软件工程研究教育提供了方向指引。
English Summary: This systematic literature review analyzes 80 studies from 2022-2025 to examine how novice developers utilize LLM-based tools in software development, categorizing their motivations, usage scenarios, benefits, challenges, and future research needs.

Authors:Clément Chadebec, Onur Tasar, Sanjeev Sreetharan, Benjamin Aubin
Title: LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Abstract:
In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an implementation at https://github.com/gojasper/LBM.
中文: 本文提出潜在桥匹配(LBM)方法,通过在潜在空间进行桥匹配实现高效图像转换,仅需单步推理即可在物体移除、重照明等多项任务中达到顶尖水平。
English: This paper presents Latent Bridge Matching (LBM), an efficient and versatile method for image-to-image translation that achieves state-of-the-art results in a single inference step across tasks like object removal and relighting.

Authors:Zhangquan Chen, Xufang Luo, Dongsheng Li
Title: VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Abstract:
Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.
中文摘要:VisRL是首个将强化学习应用于意图驱动视觉感知的框架,通过仅使用奖励信号优化整个视觉推理过程,无需昂贵区域标注即可超越现有基线方法。
English Summary: VisRL is a novel reinforcement learning framework that eliminates the need for supervised region annotations by optimizing visual reasoning through trial-and-error, achieving superior performance across multiple benchmarks.

Authors:Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
Title: TokenButler: Token Importance is Predictable
Abstract:
Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler
中文: TokenButler 提出了一种轻量级、查询感知的预测器,能动态识别KV缓存中的关键令牌,相比现有方法,在效率和准确率上提升了超过8%。
English: TokenButler introduces a lightweight, query-aware predictor that dynamically identifies critical tokens in the KV-Cache, significantly improving efficiency and accuracy over existing methods by over 8%.

Authors:Takeru Inoue, Ryusuke Miyamoto
Title: FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection
Abstract:
Instance shadow detection is the task of detecting pairs of shadows and objects, where existing methods first detect shadows and objects independently, then associate them. This paper introduces FastInstShadow, a method that enhances detection accuracy through a query-based architecture featuring an association transformer decoder with two dual-path transformer decoders to assess relationships between shadows and objects during detection. Experimental results using the SOBA dataset showed that the proposed method outperforms all existing methods across all criteria. This method makes real-time processing feasible for moderate-resolution images with better accuracy than SSISv2, the most accurate existing method. Our code is available at https://github.com/wlotkr/FastInstShadow.
中文摘要:FastInstShadow是一种基于查询的实例阴影检测方法,通过关联变换器解码器和双路径解码器提升检测精度,在SOBA数据集上全面超越现有方法,并能实现中等分辨率图像的实时处理。
English Summary: FastInstShadow is a query-based method that improves instance shadow detection accuracy using an association transformer decoder and dual-path decoders, outperforming existing techniques on the SOBA dataset while enabling real-time processing.

Authors:Jie Hu, Shizun Wang, Xinchao Wang
Title: PE3R: Perception-Efficient 3D Reconstruction
Abstract:
Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: https://github.com/hujiecpp/PE3R.
Chinese: PE3R框架通过前馈架构实现快速3D语义场重建,在零样本泛化能力上表现优异,不仅将重建速度提升至少9倍,还在感知精度和重建质量上取得显著突破。
English: The proposed PE3R framework enables rapid 3D semantic field reconstruction with robust zero-shot generalization, achieving at least a 9-fold speedup and significant improvements in accuracy and precision across diverse scenes.

Authors:Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Zhuoer Yin, Keisuke Fujii
Title: AthletePose3D: A Benchmark Dataset for 3D Human Pose Estimation and Kinematic Validation in Athletic Movements
Abstract:
Human pose estimation is a critical task in computer vision and sports biomechanics, with applications spanning sports science, rehabilitation, and biomechanical research. While significant progress has been made in monocular 3D pose estimation, current datasets often fail to capture the complex, high-acceleration movements typical of competitive sports. In this work, we introduce AthletePose3D, a novel dataset designed to address this gap. AthletePose3D includes 12 types of sports motions across various disciplines, with approximately 1.3 million frames and 165 thousand individual postures, specifically capturing high-speed, high-acceleration athletic movements. We evaluate state-of-the-art (SOTA) monocular 2D and 3D pose estimation models on the dataset, revealing that models trained on conventional datasets perform poorly on athletic motions. However, fine-tuning these models on AthletePose3D notably reduces the SOTA model mean per joint position error (MPJPE) from 214mm to 65mm-a reduction of over 69%. We also validate the kinematic accuracy of monocular pose estimations through waveform analysis, highlighting strong correlations in joint angle estimations but limitations in velocity estimation. Our work provides a comprehensive evaluation of monocular pose estimation models in the context of sports, contributing valuable insights for advancing monocular pose estimation techniques in high-performance sports environments. The dataset, code, and model checkpoints are available at: https://github.com/calvinyeungck/AthletePose3D
中文: 本研究提出了AthletePose3D数据集,旨在解决现有数据集难以捕捉高速运动姿态的局限,实验表明基于该数据集微调模型可将姿态估计误差降低69%以上,并通过波形分析验证了运动学参数的准确性。
English: The study introduces AthletePose3D, a dataset addressing the limitations of existing datasets in capturing high-acceleration sports movements, and demonstrates that fine-tuning models on it reduces pose estimation errors by over 69% while validating kinematic accuracy through waveform analysis.

Authors:Guiwei Zhang, Tianyu Zhang, Mohan Zhou, Yalong Bai, Biye Li
Title: V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
Abstract:
We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. https://github.com/zhangguiwei610/V2Flow
中文: V2Flow是一种创新的视觉分词器,通过流匹配方法将图像映射至大语言模型词表空间,结合视觉词汇重采样器和掩码自回归整流流解码器,实现高质量重建与自回归视觉生成。
English: V2Flow is a novel visual tokenizer that maps images into discrete tokens aligned with LLM vocabulary, enabling high-fidelity reconstruction and autoregressive visual generation through flow-matching and specialized decoder designs.

Authors:Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, Hao Zhao
Title: Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction
Abstract:
Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at https://github.com/XR-Lee/neural-symbolic
Chinese: 本文提出Chameleon算法,通过快慢神经符号系统交替处理常规与复杂场景,结合视觉语言模型的链式推理,实现了高效低成本的车道拓扑提取,并在OpenLane-V2数据集上验证了其优越性能。
English: This paper introduces Chameleon, a fast-slow neuro-symbolic algorithm that combines efficient instance reasoning with vision-language model analysis to affordably extract lane topology for autonomous driving, demonstrating improved performance on the OpenLane-V2 dataset.

Authors:Jiacheng Ruan, Wenzhen Yuan, Xian Gao, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Ting Liu, Yuzhuo Fu
Title: VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
Abstract:
Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at https://github.com/JCruan519/VLRMBench.
Chinese: 本文提出了VLRMBench基准,旨在全面评估视觉语言奖励模型在过程理解、结果判断和批判生成方面的能力,通过大量实验揭示了现有模型的显著挑战,弥补了当前评估体系的不足。
English: This paper introduces VLRMBench, a comprehensive benchmark designed to evaluate vision-language reward models across process understanding, outcome judgment, and critique generation, addressing limitations in existing assessments and demonstrating significant challenges through extensive experiments on multiple models.

Authors:Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
Title: YOLOE: Real-Time Seeing Anything
Abstract:
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.
中文: YOLOE是一种高效的开集检测与分割模型,通过创新的文本、视觉和无提示机制,以极低的训练成本和计算开销实现了实时性能。
English: YOLOE is an efficient open-set model integrating detection and segmentation that achieves real-time performance with minimal training cost and computational overhead through innovative text, visual, and prompt-free mechanisms.

Authors:Jimmy Gammell, Anand Raghunathan, Abolfazl Hashemi, Kaushik Roy
Title: Learning to Localize Leakage of Cryptographic Sensitive Variables
Abstract:
While cryptographic algorithms such as the ubiquitous Advanced Encryption Standard (AES) are secure, *physical implementations* of these algorithms in hardware inevitably 'leak' sensitive data such as cryptographic keys. A particularly insidious form of leakage arises from the fact that hardware consumes power and emits radiation in a manner that is statistically associated with the data it processes and the instructions it executes. Supervised deep learning has emerged as a state-of-the-art tool for carrying out *side-channel attacks*, which exploit this leakage by learning to map power/radiation measurements throughout encryption to the sensitive data operated on during that encryption. In this work we develop a principled deep learning framework for determining the relative leakage due to measurements recorded at different points in time, in order to inform *defense* against such attacks. This information is invaluable to cryptographic hardware designers for understanding *why* their hardware leaks and how they can mitigate it (e.g. by indicating the particular sections of code or electronic components which are responsible). Our framework is based on an adversarial game between a family of classifiers trained to estimate the conditional distributions of sensitive data given subsets of measurements, and a budget-constrained noise distribution which probabilistically erases individual measurements to maximize the loss of these classifiers. We demonstrate our method's efficacy and ability to overcome limitations of prior work through extensive experimental comparison with 8 baseline methods using 3 evaluation metrics and 6 publicly-available power/EM trace datasets from AES, ECC and RSA implementations. We provide an open-source PyTorch implementation of these experiments.
中文: 本研究开发了一种深度学习框架,用于识别加密硬件何时通过功耗和电磁辐射泄露敏感数据,从而帮助设计者精确定位漏洞并加强针对侧信道攻击的防护能力。
English: This research develops a deep learning framework to identify when cryptographic hardware leaks sensitive data through power consumption and electromagnetic radiation, enabling designers to pinpoint vulnerabilities and strengthen defenses against side-channel attacks.

Authors:Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
Title: MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Abstract:
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.
中文: MedAgentsBench是一个针对复杂医学推理任务的新基准,旨在解决现有评估的局限性,并揭示先进模型在需要多步骤临床推理的难题上的性能差异。
English: MedAgentsBench is a new benchmark that evaluates LLMs on complex medical reasoning tasks where current models still struggle, addressing limitations in existing evaluations and revealing performance gaps among advanced models.

Authors:Feiran You, Hongyang Du, Xiangwang Hou, Yong Ren, Kaibin Huang
Title: DRESS: Diffusion Reasoning-based Reward Shaping Scheme For Intelligent Networks
Abstract:
Network optimization remains fundamental in wireless communications, with Artificial Intelligence (AI)-based solutions gaining widespread adoption. As Sixth-Generation (6G) communication networks pursue full-scenario coverage, optimization in complex extreme environments presents unprecedented challenges. The dynamic nature of these environments, combined with physical constraints, makes it difficult for AI solutions such as Deep Reinforcement Learning (DRL) to obtain effective reward feedback for the training process. However, many existing DRL-based network optimization studies overlook this challenge through idealized environment settings. Inspired by the powerful capabilities of Generative AI (GenAI), especially diffusion models, in capturing complex latent distributions, we introduce a novel Diffusion Reasoning-based Reward Shaping Scheme (DRESS) to achieve robust network optimization. By conditioning on observed environmental states and executed actions, DRESS leverages diffusion models' multi-step denoising process as a form of deep reasoning, progressively refining latent representations to generate meaningful auxiliary reward signals that capture patterns of network systems. Moreover, DRESS is designed for seamless integration with any DRL framework, allowing DRESS-aided DRL (DRESSed-DRL) to enable stable and efficient DRL training even under extreme network environments. Experimental results demonstrate that DRESSed-DRL achieves about 1.5x times faster convergence than its original version in sparse-reward wireless environments and significant performance improvements in multiple general DRL benchmark environments compared to baseline methods. The code of DRESS is available at https://github.com/NICE-HKU/DRESS.
中文摘要:本文提出DRESS方案,利用扩散模型通过多步去噪推理生成辅助奖励信号,使深度强化学习在极端无线网络环境中实现稳定高效的优化,实验证明其收敛速度提升约1.5倍且性能显著优于基线方法。
English Summary: The paper introduces DRESS, a diffusion model-based reward shaping scheme that enhances Deep Reinforcement Learning by generating auxiliary rewards for stable and efficient network optimization in extreme wireless environments, achieving faster convergence and significant performance gains.

Authors:Yan Tai, Luhao Zhu, Zhiqiang Chen, Ynan Ding, Yiying Dong, Xiaohong Liu, Guodong Guo
Title: REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at https://github.com/MacavityT/REF-VLM.
Chinese: REF-VLM提出了一种端到端的统一框架,采用基于三元组的指代范式来提高多模态大语言模型在复杂视觉解码任务中的适应性,并借助大规模数据集在多个基准测试中优于现有模型。
English: REF-VLM introduces a unified end-to-end framework with a triplet-based referring paradigm to enhance multimodal large language models' adaptability in complex visual decoding tasks, supported by a large-scale dataset and outperforming existing models on benchmarks.

Authors:Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Abstract:
Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.
中文摘要:SPEED是一种通过空空间优化直接编辑模型参数的高效概念消除方法,能够在快速移除目标概念的同时保持非目标内容的生成质量。
English Summary: SPEED is an efficient concept erasure method that directly edits diffusion model parameters through null space optimization, enabling rapid removal of target concepts while preserving non-target content quality.

Authors:Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Title: SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
Abstract:
Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.
中文摘要:SPEED是一种通过空空间优化直接编辑模型参数的高效概念消除方法,能够在快速移除目标概念的同时保持非目标内容的生成质量。
English Summary: SPEED is an efficient concept erasure method that directly edits diffusion model parameters through null space optimization, enabling rapid removal of target concepts while preserving non-target content quality.

Authors:Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
Abstract:
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.
中文摘要:本研究提出TRCE方法,通过两阶段概念消除策略:先消除文本提示中的隐含恶意语义,再通过对比学习引导去噪过程朝向安全方向,从而在有效清除恶意内容的同时更好地保持模型的正常生成能力。
English Summary: The study introduces TRCE, a two-stage concept erasure method that effectively removes malicious content from text-to-image models by first neutralizing harmful semantics in prompts and then guiding the denoising process toward safe outputs, achieving a balance between erasure and preservation of generation quality.

Authors:Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu
Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
Abstract:
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.
中文摘要:本研究提出TRCE方法,通过两阶段概念消除策略:先消除文本提示中的隐含恶意语义,再通过对比学习引导去噪过程朝向安全方向,从而在有效清除恶意内容的同时更好地保持模型的正常生成能力。
English Summary: The study introduces TRCE, a two-stage concept erasure method that effectively removes malicious content from text-to-image models by first neutralizing harmful semantics in prompts and then guiding the denoising process toward safe outputs, achieving a balance between erasure and preservation of generation quality.

Authors:Chongming Gao, Mengyao Gao, Chenxiao Fan, Shuai Yuan, Wentao Shi, Xiangnan He
Title: Process-Supervised LLM Recommenders via Flow-guided Tuning
Abstract:
While large language models (LLMs) are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of accuracy, fairness, and diversity, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via https://github.com/MrPeach0301/Flower
中文摘要:Flower采用生成流网络框架替代监督微调,通过将项目奖励分解为词元级奖励来缓解流行度偏差,从而提升推荐系统的公平性、多样性和准确性。
English Summary: Flower introduces a GFlowNet framework to replace supervised fine-tuning in LLM-based recommendation systems, mitigating popularity bias by decomposing item rewards into token-level rewards to enhance fairness, diversity, and accuracy.

Authors:Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao
Title: MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Abstract:
DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA
中文: MM-EUREKA模型基于高质量多模态数学推理数据集MMK12,采用规则强化学习训练,在多模态数学推理中表现卓越,超越先前强大模型并与顶尖系统性能相当,同时全面开源促进后续研究。
English: The MM-EUREKA model, trained on the high-quality MMK12 dataset using rule-based reinforcement learning, achieves significant advancements in multimodal mathematical reasoning, outperforming previous models and closely competing with leading systems while being fully open-sourced.

Authors:Johan Edstedt, Georg Bökman, Mårten Wadenbäck, Michael Felsberg
Title: DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection
Abstract:
Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at https://github.com/parskatt/dad
中文摘要:本文提出了一种完全自监督且无需描述符的关键点检测方法,通过强化学习和平衡采样策略,在多个基准测试中显著超越了现有最优技术。
English Summary: This paper introduces a fully self-supervised, descriptor-free keypoint detection method using reinforcement learning and a balanced sampling strategy, which significantly outperforms state-of-the-art approaches across multiple benchmarks.

Authors:Xing Xie, Jiawei Liu, Ziyue Lin, Huijie Fan, Zhi Han, Yandong Tang, Liangqiong Qu
Title: Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Abstract:
We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural modifications. Different from prior works that require complex architectural redesigns, ARRA aligns LLM's hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, [object Object]. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training T2I LLMs from scratch, ARRA reduces FID by 16.6% (ImageNet), 12.0% (LAION-COCO) for autoregressive LLMs like LlamaGen, without modifying original architecture and inference mechanism. For training from text-generation-only LLMs, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet) for advanced LLMs like Chameleon. For domain adaptation, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). These results demonstrate that training objective redesign, rather than architectural modifications, can resolve cross-modal global coherence challenges. ARRA offers a complementary paradigm for advancing autoregressive models. The code is available at https://github.com/xiexing0916/ARRA.
Chinese: 我们提出自回归表征对齐(ARRA)训练框架,通过混合令牌和双重约束将LLM隐藏状态与外部视觉表征对齐,无需修改架构即可实现全局连贯的文生图生成,在多个数据集和领域显著提升性能。
English: We introduce Autoregressive Representation Alignment (ARRA), a training framework that enables autoregressive LLMs to generate globally coherent images by aligning hidden states with visual representations through a hybrid token and dual constraints, without architectural changes, significantly improving performance across various datasets and domains.

Authors:Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low
Title: Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions
Abstract:
Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.
中文摘要:机器学习模型常在不同数据子群体间表现不均,本研究提出的群体鲁棒样本重加权(GSR)方法,通过两阶段策略利用有限标注数据优化未标注样本权重,以更高效的标签使用显著提升模型对群体分布变化的鲁棒性,性能优于现有最佳方法。
English summary: Machine learning models often struggle with performance disparities across subpopulations, and this study introduces Group-robust Sample Reweighting (GSR), a lightweight two-stage method that enhances robustness to group shifts by efficiently utilizing limited labeled data to reweight unlabeled samples, outperforming prior approaches with equal or fewer labels.

Authors:Weijia Wu, Zeyu Zhu, Mike Zheng Shou
Title: Automated Movie Generation via Multi-Agent CoT Planning
Abstract:
Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi-agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long-video generation. Given a script and character bank, our MovieAgent can generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT-based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state-of-the-art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: https://github.com/showlab/MovieAgent and https://weijiawu.github.io/MovieAgent.
中文摘要:MovieAgent通过多智能体思维链规划实现了自动化长视频生成,能够根据剧本自动构建场景和摄影设置,在保证角色一致性和叙事连贯性的同时大幅降低了人工成本。
English Summary: MovieAgent introduces an automated long-form video generation system using multi-agent Chain of Thought planning to create coherent multi-scene videos from scripts while ensuring character consistency and synchronized elements, significantly reducing manual effort.

Authors:Won-Sang You, Tae-Gwan Ha, Seo-Young Lee, Kyung-Joong Kim
Title: Automatic Curriculum Design for Zero-Shot Human-AI Coordination
Abstract:
Zero-shot human-AI coordination is the training of an ego-agent to coordinate with humans without human data. Most studies on zero-shot human-AI coordination have focused on enhancing the ego-agent's coordination ability in a given environment without considering the issue of generalization to unseen environments. Real-world applications of zero-shot human-AI coordination should consider unpredictable environmental changes and the varying coordination ability of co-players depending on the environment. Previously, the multi-agent UED (Unsupervised Environment Design) approach has investigated these challenges by jointly considering environmental changes and co-player policy in competitive two-player AI-AI scenarios. In this paper, our study extends a multi-agent UED approach to zero-shot human-AI coordination. We propose a utility function and co-player sampling for a zero-shot human-AI coordination setting that helps train the ego-agent to coordinate with humans more effectively than a previous multi-agent UED approach. The zero-shot human-AI coordination performance was evaluated in the Overcooked-AI environment, using human proxy agents and real humans. Our method outperforms other baseline models and achieves high performance in human-AI coordination tasks in unseen environments. The source code is available at https://github.com/Uwonsang/ACD_Human-AI
中文: 本研究将多智能体无监督环境设计扩展到零样本人机协作,提出了一种效用函数和协作者采样方法,在未见环境中优于先前方法。
English: This study extends multi-agent Unsupervised Environment Design to zero-shot human-AI coordination, proposing a utility function and co-player sampling method that outperforms previous approaches in unseen environments.

Authors:Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, Li Yuan
Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Abstract:
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
中文: WISE基准通过引入涵盖25个领域的结构化提示和创新的WiScore评估指标,解决了文本到图像模型在复杂语义理解与世界知识整合方面的评估空白,揭示了现有模型在知识应用上的显著不足。
English: The WISE benchmark addresses the gap in evaluating complex semantic understanding and world knowledge integration in text-to-image models by introducing structured prompts across 25 domains and a novel WiScore metric, revealing significant limitations in current models' knowledge application.

Authors:Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim
Title: COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
Abstract:
Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .
中文摘要:COMODO是一种跨模态自监督蒸馏框架,通过将视频语义知识迁移至IMU传感器,无需标注数据即可实现高效且保护隐私的人类活动识别,并展现出卓越性能。
English Summary: COMODO is a self-supervised framework that transfers semantic knowledge from video to IMU sensors, enabling efficient and privacy-preserving human activity recognition without labeled data while achieving superior performance.

Authors:Xinyu Nan, Meng He, Zifan Chen, Bin Dong, Lei Tang, Li Zhang
Title: AI-Driven Automated Tool for Abdominal CT Body Composition Analysis in Gastrointestinal Cancer Management
Abstract:
The incidence of gastrointestinal cancers remains significantly high, particularly in China, emphasizing the importance of accurate prognostic assessments and effective treatment strategies. Research shows a strong correlation between abdominal muscle and fat tissue composition and patient outcomes. However, existing manual methods for analyzing abdominal tissue composition are time-consuming and costly, limiting clinical research scalability. To address these challenges, we developed an AI-driven tool for automated analysis of abdominal CT scans to effectively identify and segment muscle, subcutaneous fat, and visceral fat. Our tool integrates a multi-view localization model and a high-precision 2D nnUNet-based segmentation model, demonstrating a localization accuracy of 90% and a Dice Score Coefficient of 0.967 for segmentation. Furthermore, it features an interactive interface that allows clinicians to refine the segmentation results, ensuring high-quality outcomes effectively. Our tool offers a standardized method for effectively extracting critical abdominal tissues, potentially enhancing the management and treatment for gastrointestinal cancers. The code is available at https://github.com/NanXinyu/AI-Tool4Abdominal-Seg.git}{https://github.com/NanXinyu/AI-Tool4Abdominal-Seg.git.
Chinese Summary: 开发了一款基于人工智能的工具,用于自动分析腹部CT扫描,精确分割肌肉和脂肪组织,提供标准化方法以改善胃肠道癌症的治疗管理。
English Summary: An AI-driven tool was developed to automate the analysis of abdominal CT scans, accurately segmenting muscle and fat tissues with high precision, offering a standardized method to enhance gastrointestinal cancer management.

Authors:Fareed Qararyah, Mohammad Ali Maleki, Pedro Trancoso
Title: An Analytical Cost Model for Fast Evaluation of Multiple Compute-Engine CNN Accelerators
Abstract:
Convolutional Neural Networks (CNNs) serve various applications with diverse performance and resource requirements. Model-aware CNN accelerators best address these diverse requirements. These accelerators usually combine multiple dedicated Compute Engines (CEs). The flexibility of Field-Programmable Gate Arrays (FPGAs) enables the design of such multiple Compute-Engine (multiple-CE) accelerators. However, existing multiple-CE accelerators differ in how they arrange their CEs and distribute the FPGA resources and CNN operators among the CEs. The design space of multiple-CE accelerators comprises numerous such arrangements, which makes a systematic identification of the best ones an open challenge. This paper proposes a multiple-CE accelerator analytical Cost Model (MCCM) and an evaluation methodology built around MCCM. The model and methodology streamline the expression of any multiple-CE accelerator and provide a fast evaluation of its performance and efficiency. MCCM is in the order of 100000x faster than traditional synthesis-based evaluation and has an average accuracy of > 90%. The paper presents three use cases of MCCM. The first describes an end-to-end evaluation of state-of-the-art multiple-CE accelerators considering various metrics, CNN models, and resource budgets. The second describes fine-grained evaluation that helps identify performance bottlenecks of multiple-CE accelerators. The third demonstrates that MCCM fast evaluation enables exploring the vast design space of multiple-CE accelerators. These use cases show that no unique CE arrangement achieves the best results given different metrics, CNN models, and resource budgets. They also show that fast evaluation enables design space exploration, resulting in accelerator designs that outperform state-of-the-art ones. MCCM is available at https://github.com/fqararyah/MCCM.
中文: 本文提出MCCM这一快速精确的分析成本模型,能够高效探索多计算引擎CNN加速器的设计空间,并通过三个应用案例证明其可超越现有方案的优越性能。
English: This paper introduces MCCM, a fast and accurate analytical cost model that enables efficient design space exploration of multiple-CE CNN accelerators, demonstrating its effectiveness through three use cases that outperform existing approaches.

Authors:Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Lilun Deng, Yukun Cui, Shuang Xu
Title: Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion
Abstract:
Multi-exposure image fusion (MEF) synthesizes multiple, differently exposed images of the same scene into a single, well-exposed composite. Retinex theory, which separates image illumination from scene reflectance, provides a natural framework to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components with a shared reflectance component, and effectively models the glare induced by overexposure. The shared reflectance is learned via a bidirectional loss, which enables our approach to effectively mitigate the glare effect. Furthermore, we introduce a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of a fixed exposure level. Extensive experiments on diverse datasets, including underexposure-overexposure fusion, exposure controlled fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code is available at https://github.com/HaowenBai/Retinex-MEF
中文:Retinex-MEF是一种无监督可控方法,通过将多曝光图像分解为光照和共享反射分量,有效抑制过曝光眩光,并能在保持对比度的同时实现灵活的全局曝光调节。
English: Retinex-MEF is an unsupervised and controllable method that decomposes multi-exposure images into illumination and shared reflectance components, effectively mitigating glare effects and enabling flexible global exposure adjustments while preserving contrast.

Authors:Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
Title: Effective and Efficient Masked Image Generation Models
Abstract:
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models. Code is available at https://github.com/ML-GSAI/eMIGM.
中文: eMIGM模型将掩码图像生成与扩散模型统一起来,在ImageNet基准测试中以更少的函数评估次数实现了优于现有方法的效率与性能表现。
English: The eMIGM model unifies masked image generation and diffusion approaches, demonstrating superior efficiency and performance on ImageNet benchmarks by requiring significantly fewer function evaluations than existing methods.

Authors:Yan Ren, Shilin Lu, Adams Wai-Kin Kong
Title: All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have revolutionized scene reconstruction, opening new possibilities for 3D steganography by hiding 3D secrets within 3D covers. The key challenge in steganography is ensuring imperceptibility while maintaining high-fidelity reconstruction. However, existing methods often suffer from detectability risks and utilize only suboptimal 3DGS features, limiting their full potential. We propose a novel end-to-end key-secured 3D steganography framework (KeySS) that jointly optimizes a 3DGS model and a key-secured decoder for secret reconstruction. Our approach reveals that Gaussian features contribute unequally to secret hiding. The framework incorporates a key-controllable mechanism enabling multi-secret hiding and unauthorized access prevention, while systematically exploring optimal feature update to balance fidelity and security. To rigorously evaluate steganographic imperceptibility beyond conventional 2D metrics, we introduce 3D-Sinkhorn distance analysis, which quantifies distributional differences between original and steganographic Gaussian parameters in the representation space. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both cover and secret reconstruction while maintaining high security levels, advancing the field of 3D steganography. Code is available at https://github.com/RY-Paper/KeySS
中文:我们提出的KeySS框架开创了一种新型端到端3D隐写方法,通过联合优化高斯溅射模型与密钥保护解码器,借助优化特征选择和3D-Sinkhorn距离评估,实现了卓越的安全性和重建保真度。
English: Our proposed KeySS framework introduces a novel end-to-end 3D steganography method that jointly optimizes Gaussian Splatting models with key-secured decoders, achieving superior security and reconstruction fidelity through optimized feature selection and 3D-Sinkhorn distance evaluation.

Authors:Kazuya Nishimura, Ryoma Bise, Yasuhiro Kojima
Title: Towards Spatial Transcriptomics-guided Pathological Image Recognition with Batch-Agnostic Encoder
Abstract:
Spatial transcriptomics (ST) is a novel technique that simultaneously captures pathological images and gene expression profiling with spatial coordinates. Since ST is closely related to pathological features such as disease subtypes, it may be valuable to augment image representation with pathological information. However, there are no attempts to leverage ST for image recognition ({\it i.e,} patch-level classification of subtypes of pathological image.). One of the big challenges is significant batch effects in spatial transcriptomics that make it difficult to extract pathological features of images from ST. In this paper, we propose a batch-agnostic contrastive learning framework that can extract consistent signals from gene expression of ST in multiple patients. To extract consistent signals from ST, we utilize the batch-agnostic gene encoder that is trained in a variational inference manner. Experiments demonstrated the effectiveness of our framework on a publicly available dataset. Code is publicly available at https://github.com/naivete5656/TPIRBAE
中文: 空间转录组学是一种结合病理图像与基因表达数据的新技术,本文提出了一种批次无关的对比学习框架,能从中提取一致性信号,并在公开数据集上验证了其有效性。
English: Spatial transcriptomics is a cutting-edge method that combines pathological images with gene expression data, and this paper introduces a batch-agnostic contrastive learning framework to extract consistent signals from it, validated on a public dataset.

Authors:Zhen Zou, Feng Zhao
Title: FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching
Abstract:
Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration. Code is available at https://github.com/aSleepyTree/EB-Cache.
中文: 扩散变换器(DiT)存在高计算复杂度问题,特征缓存方法虽被采用,但会加剧曝光偏差,导致生成质量下降;提出的FEB-Cache策略基于频率信号分离注意力和MLP缓存,有效减少偏差并提升性能和加速效果。
English: The Diffusion Transformer (DiT) faces computational challenges, and while feature caching is used to address this, it exacerbates exposure bias, leading to reduced generation quality; the proposed FEB-Cache strategy separates Attention and MLP caching based on frequency signals to mitigate this bias and enhance performance and acceleration.

Authors:Hanqing Guo, Xiuxiu Lin, Shiyu Zhao
Title: YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion
Abstract:
Vision-based drone-to-drone detection has attracted increasing attention due to its importance in numerous tasks such as vision-based swarming, aerial see-and-avoid, and malicious drone detection. However, existing methods often encounter failures when the background is complex or the target is tiny. This paper proposes a novel end-to-end framework that accurately identifies small drones in complex environments using motion guidance. It starts by creating a motion difference map to capture the motion characteristics of tiny drones. Next, this motion difference map is combined with an RGB image using a bimodal fusion module, allowing for adaptive feature learning of the drone. Finally, the fused feature map is processed through an enhanced backbone and detection head based on the YOLOv5 framework to achieve accurate detection results. To validate our method, we propose a new dataset, named ARD100, which comprises 100 videos (202,467 frames) covering various challenging conditions and has the smallest average object size compared with the existing drone detection datasets. Extensive experiments on the ARD100 and NPS-Drones datasets show that our proposed detector performs exceptionally well under challenging conditions and surpasses state-of-the-art algorithms across various metrics. We publicly release the codes and ARD100 dataset at https://github.com/Irisky123/YOLOMG.
中文: 本文提出了一种新颖的端到端框架,通过融合运动引导与RGB图像来提升复杂环境下小型无人机的检测精度,并在多个数据集上验证了其优越性能。
English: This paper introduces an end-to-end framework that enhances small drone detection in complex environments by integrating motion guidance with RGB imagery and outperforms existing methods on challenging datasets.

Authors:Shuhe Wang, Xiaoya Li, Jiwei Li, Guoyin Wang, Xiaofei Sun, Bob Zhu, Han Qiu, Mo Yu, Shengjie Shen, Tianwei Zhang, Eduard Hovy
Title: FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset
Abstract:
Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field. To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development. We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: https://github.com/ShuheSH/FaceID-6M.
中文: 现有FaceID模型依赖非公开数据集,因此本文推出高质量开源数据集FaceID-6M,包含600万图文对,既能训练出有竞争力的模型,又通过公开资源推动该领域研究发展。
English: Current FaceID models rely on non-public datasets, so this paper introduces FaceID-6M, a high-quality open-source dataset with 6 million text-image pairs that enables training competitive models and promotes transparency in research.

Authors:Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail, Xiao Xiang Zhu, Gustau Camps-Valls, Ioannis Papoutsis
Title: On the Generalization of Representation Uncertainty in Earth Observation
Abstract:
Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: https://github.com/Orion-AI-Lab/EOUncertaintyGeneralization.
中文: 计算机视觉的最新进展通过预训练表示不确定性实现了零样本不确定性估计,该技术在地球观测任务中展现出跨领域和地理位置的强大泛化能力,同时与任务特定不确定性及实际噪声敏感性保持一致。
English: Recent advances in computer vision enable zero-shot uncertainty estimation through pretrained representation uncertainty, which shows strong generalization in Earth Observation tasks across domains and geographic locations while aligning with task-specific uncertainties and real-world noise sensitivity.

Authors:Xiaotian Han, Tianlong Chen, Kaixiong Zhou, Zhimeng Jiang, Zhangyang Wang, Xia Hu
Title: You Only Debias Once: Towards Flexible Accuracy-Fairness Trade-offs at Inference Time
Abstract:
Deep neural networks are prone to various bias issues, jeopardizing their applications for high-stake decision-making. Existing fairness methods typically offer a fixed accuracy-fairness trade-off, since the weight of the well-trained model is a fixed point (fairness-optimum) in the weight space. Nevertheless, more flexible accuracy-fairness trade-offs at inference time are practically desired since: 1) stakes of the same downstream task can vary for different individuals, and 2) different regions have diverse laws or regularization for fairness. If using the previous fairness methods, we have to train multiple models, each offering a specific level of accuracy-fairness trade-off. This is often computationally expensive, time-consuming, and difficult to deploy, making it less practical for real-world applications. To address this problem, we propose You Only Debias Once (YODO) to achieve in-situ flexible accuracy-fairness trade-offs at inference time, using a single model that trained only once. Instead of pursuing one individual fixed point (fairness-optimum) in the weight space, we aim to find a "line" in the weight space that connects the accuracy-optimum and fairness-optimum points using a single model. Points (models) on this line implement varying levels of accuracy-fairness trade-offs. At inference time, by manually selecting the specific position of the learned "line", our proposed method can achieve arbitrary accuracy-fairness trade-offs for different end-users and scenarios. Experimental results on tabular and image datasets show that YODO achieves flexible trade-offs between model accuracy and fairness, at ultra-low overheads. For example, if we need $100$ levels of trade-off on the \acse dataset, YODO takes $3.53$ seconds while training $100$ fixed models consumes $425$ seconds. The code is available at https://github.com/ahxt/yodo.
Chinese: 提出的“一次去偏”方法通过单一模型在推理时实现灵活的准确性与公平性权衡,无需训练多个模型,通过在权重空间中找到连接准确性和公平性最优点的直线来实现这一目标。
English: The proposed You Only Debias Once (YODO) method enables flexible accuracy-fairness trade-offs during inference using a single model, eliminating the need for multiple models by finding an optimal line in the weight space that connects accuracy and fairness optima.

Authors:Sheng Luo, Yi Zhou, Tao Zhou
Title: Universal Incremental Learning: Mitigating Confusion from Inter- and Intra-task Distribution Randomness
Abstract:
Incremental learning (IL) aims to overcome catastrophic forgetting of previous tasks while learning new ones. Existing IL methods make strong assumptions that the incoming task type will either only increases new classes or domains (i.e. Class IL, Domain IL), or increase by a static scale in a class- and domain-agnostic manner (i.e. Versatile IL (VIL)), which greatly limit their applicability in the unpredictable and dynamic wild. In this work, we investigate $\textbf{Universal Incremental Learning (UIL)}$, where a model neither knows which new classes or domains will increase along sequential tasks, nor the scale of the increments within each task. This uncertainty prevents the model from confidently learning knowledge from all task distributions and symmetrically focusing on the diverse knowledge within each task distribution. Consequently, UIL presents a more general and realistic IL scenario, making the model face confusion arising from inter-task and intra-task distribution randomness. To $\textbf{Mi}$tigate both $\textbf{Co}$nfusion, we propose a simple yet effective framework for UIL, named $\textbf{MiCo}$. At the inter-task distribution level, we employ a multi-objective learning scheme to enforce accurate and deterministic predictions, and its effectiveness is further enhanced by a direction recalibration module that reduces conflicting gradients. Moreover, at the intra-task distribution level, we introduce a magnitude recalibration module to alleviate asymmetrical optimization towards imbalanced class distribution. Extensive experiments on three benchmarks demonstrate the effectiveness of our method, outperforming existing state-of-the-art methods in both the UIL scenario and the VIL scenario. Our code will be available at $\href{https://github.com/rolsheng/UIL}{here}$.
中文: 本文提出通用增量学习(UIL)这一更现实的场景,模型需面对类别和领域不可预测的双重增量挑战,并通过多目标学习和校准模块的MiCo框架有效缓解任务间与任务内的混淆,在多个基准测试中显著超越现有方法。
English: This paper introduces Universal Incremental Learning (UIL), a more realistic scenario where models face unpredictable task increments in both classes and domains, and proposes MiCo, a framework that mitigates inter-task and intra-task confusion through multi-objective learning and recalibration modules, demonstrating superior performance over existing methods.

Authors:Dong-Hee Paek, Seung-Hyun Kong
Title: Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera
Abstract:
Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving (AD). However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions; Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. The code will be available at https://github.com/kaist-avelab/K-Radar.
中文: 本文提出可用性感知传感器融合(ASF)方法,通过统一规范投影和跨传感器补丁交叉注意力机制,在恶劣天气和传感器故障条件下显著提升自动驾驶物体检测性能,同时保持较低计算成本。
English: This paper introduces availability-aware sensor fusion (ASF), a novel method that enhances autonomous driving robustness by employing unified canonical projection and cross-attention across sensors, achieving superior object detection performance under adverse conditions with low computational cost.

Authors:Mohammed Mahfoud, Ghait Boukachab, Michał Koziarski, Alex Hernandez-Garcia, Stefan Bauer, Yoshua Bengio, Nikolay Malkin
Title: Learning Decision Trees as Amortized Structure Inference
Abstract:
Building predictive models for tabular data presents fundamental challenges, notably in scaling consistently, i.e., more resources translating to better performance, and generalizing systematically beyond the training data distribution. Designing decision tree models remains especially challenging given the intractably large search space, and most existing methods rely on greedy heuristics, while deep learning inductive biases expect a temporal or spatial structure not naturally present in tabular data. We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data, formulating decision tree construction as a sequential planning problem. We train a deep reinforcement learning (GFlowNet) policy to solve this problem, yielding a generative model that samples decision trees from the Bayesian posterior. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks derived from real-world data, robustness to distribution shifts, and anomaly detection, all while yielding interpretable models with shorter description lengths. Samples from the trained DT-GFN model can be ensembled to construct a random forest, and we further show that the performance of scales consistently in ensemble size, yielding ensembles of predictors that continue to generalize systematically.
中文摘要:DT-GFN采用深度强化学习方法将决策树构建转化为序列规划问题,在分类性能、鲁棒性和可解释性上超越现有方法,并能通过集成学习实现持续的性能提升。
English Summary: DT-GFN introduces a deep reinforcement learning approach to construct decision trees as sequential planning problems, outperforming existing methods in classification, robustness, and interpretability while scaling effectively with ensemble size.

Authors:Jiahao Wang, Xiangyu Cao, Jiaru Zhong, Yuner Zhang, Haibao Yu, Lei He, Shaobing Xu
Title: Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark
Abstract:
Despite significant advancements, autonomous driving systems continue to struggle with occluded objects and long-range detection due to the inherent limitations of single-perspective sensing. Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations. However, progress in this emerging field has been hindered by the absence of public datasets and standardized evaluation benchmarks. To address this gap, this paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions: (1) Griffin, a large-scale multi-modal dataset featuring over 200 dynamic scenes (30k+ frames) with varied UAV altitudes (20-60m), diverse weather conditions, and occlusion-aware 3D annotations, enhanced by CARLA-AirSim co-simulation for realistic UAV dynamics; (2) A unified benchmarking framework for aerial-ground cooperative detection and tracking tasks, including protocols for evaluating communication efficiency, latency tolerance, and altitude adaptability; (3) AGILE, an instance-level intermediate fusion baseline that dynamically aligns cross-view features through query-based interaction, achieving an advantageous balance between communication overhead and perception accuracy. Extensive experiments prove the effectiveness of aerial-ground cooperative perception and demonstrate the direction of further research. The dataset and codes are available at https://github.com/wang-jh18-SVM/Griffin.
中文: 自动驾驶系统因单视角感知的局限难以处理遮挡和远距离检测,而空地协同通过提供数据集、基准框架和融合方法,有效提升了感知精度与效率。
English: Autonomous driving faces challenges with occlusions and long-range detection, which aerial-ground cooperation addresses through a new dataset, benchmarking framework, and fusion method that enhance perception accuracy and efficiency.

Authors:Shining Wang, Yunlong Wang, Ruiqi Wu, Bingliang Jiao, Wenxuan Wang, Peng Wang
Title: SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
Abstract:
When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on https://github.com/wangshining681/SeCap-AGPReID.
中文:SeCap方法通过自适应重新校准提示和优化局部特征来解决空地行人重识别难题,同时引入两个新数据集以推动该领域研究。
English: The SeCap method addresses aerial-ground person re-identification challenges by adaptively recalibrating prompts and refining local features, while introducing two new datasets to advance research in this field.

Authors:Zenghao Guan, Yucan Zhou, Xiaoyan Gu
Title: Capture Global Feature Statistics for One-Shot Federated Learning
Abstract:
Traditional Federated Learning (FL) necessitates numerous rounds of communication between the server and clients, posing significant challenges including high communication costs, connection drop risks and susceptibility to privacy attacks. One-shot FL has become a compelling learning paradigm to overcome above drawbacks by enabling the training of a global server model via a single communication round. However, existing one-shot FL methods suffer from expensive computation cost on the server or clients and cannot deal with non-IID (Independent and Identically Distributed) data stably and effectively. To address these challenges, this paper proposes FedCGS, a novel Federated learning algorithm that Capture Global feature Statistics leveraging pre-trained models. With global feature statistics, we achieve training-free and heterogeneity-resistant one-shot FL. Furthermore, we extend its application to personalization scenario, where clients only need execute one extra communication round with server to download global statistics. Extensive experimental results demonstrate the effectiveness of our methods across diverse data heterogeneity settings. Code is available at https://github.com/Yuqin-G/FedCGS.
中文: 传统联邦学习存在高通信成本和隐私风险,单轮联邦学习虽能克服这些缺点,但面临计算开销大和非独立同分布数据处理的挑战;本文提出FedCGS算法,利用预训练模型捕获全局特征统计量,实现无需训练、抗异构的单轮联邦学习,并可扩展至个性化场景,仅需额外一轮通信。
English: Traditional Federated Learning faces high communication costs and privacy risks, which one-shot FL addresses but struggles with computational expenses and non-IID data; this paper introduces FedCGS, a method that uses global feature statistics from pre-trained models for efficient, heterogeneity-resistant one-shot FL and personalization with minimal extra communication.

Authors:Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi
Title: A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
Abstract:
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.
中文摘要:研究表明,预训练视觉模型在机器人技术中的成功关键在于其从非物体中心数据中形成物体中心表征的能力,据此开发的SlotMIM方法在多项任务和数据集上均实现了卓越性能提升。
English Summary: This study reveals that pre-trained vision models' success in robotics hinges on their ability to form object-centric representations from non-object-centric data, leading to the development of SlotMIM, which achieves superior performance across various tasks and datasets.

Authors:Shrutika Vishal Thengane, Marcel Bartholomeus Prasetyo, Yu Xiang Tan, Malika Meghjani
Title: MERLION: Marine ExploRation with Language guIded Online iNformative Visual Sampling and Enhancement
Abstract:
Autonomous and targeted underwater visual monitoring and exploration using Autonomous Underwater Vehicles (AUVs) can be a challenging task due to both online and offline constraints. The online constraints comprise limited onboard storage capacity and communication bandwidth to the surface, whereas the offline constraints entail the time and effort required for the selection of desired key frames from the video data. An example use case of targeted underwater visual monitoring is finding the most interesting visual frames of fish in a long sequence of an AUV's visual experience. This challenge of targeted informative sampling is further aggravated in murky waters with poor visibility. In this paper, we present MERLION, a novel framework that provides semantically aligned and visually enhanced summaries for murky underwater marine environment monitoring and exploration. Specifically, our framework integrates (a) an image-text model for semantically aligning the visual samples to the users' needs, (b) an image enhancement model for murky water visual data and (c) an informative sampler for summarizing the monitoring experience. We validate our proposed MERLION framework on real-world data with user studies and present qualitative and quantitative results using our evaluation metric and show improved results compared to the state-of-the-art approaches. We have open-sourced the code for MERLION at the following link https://github.com/MARVL-Lab/MERLION.git.
Chinese: MERLION框架通过整合语义对齐、图像增强和信息采样技术,为浑浊水下环境生成简洁的视觉摘要,有效解决了水下监测的挑战。
English: The MERLION framework addresses underwater monitoring challenges by integrating semantic alignment, image enhancement, and informative sampling to generate concise visual summaries for murky marine environments.

Authors:Junyan Lin, Feng Gap, Lin Qi, Junyu Dong, Qian Du, Xinbo Gao
Title: Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification
Abstract:
Hyperspectral image (HSI) and LiDAR data joint classification is a challenging task. Existing multi-source remote sensing data classification methods often rely on human-designed frameworks for feature extraction, which heavily depend on expert knowledge. To address these limitations, we propose a novel Dynamic Cross-Modal Feature Interaction Network (DCMNet), the first framework leveraging a dynamic routing mechanism for HSI and LiDAR classification. Specifically, our approach introduces three feature interaction blocks: Bilinear Spatial Attention Block (BSAB), Bilinear Channel Attention Block (BCAB), and Integration Convolutional Block (ICB). These blocks are designed to effectively enhance spatial, spectral, and discriminative feature interactions. A multi-layer routing space with routing gates is designed to determine optimal computational paths, enabling data-dependent feature fusion. Additionally, bilinear attention mechanisms are employed to enhance feature interactions in spatial and channel representations. Extensive experiments on three public HSI and LiDAR datasets demonstrate the superiority of DCMNet over state-of-the-art methods. Our code will be available at https://github.com/oucailab/DCMNet.
中文: 本文提出DCMNet动态跨模态网络,通过路由机制和双线性注意力模块增强高光谱与激光雷达数据的特征交互,在公开数据集上实现了优于现有方法的分类性能。
English: The paper introduces DCMNet, a dynamic cross-modal network using routing mechanisms and bilinear attention blocks to enhance feature fusion for hyperspectral and LiDAR image classification, outperforming existing methods on public datasets.

Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang
Title: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
Abstract:
Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer
Chinese: TaylorSeer通过泰勒级数展开预测未来时间步的特征,无需额外训练即可在图像和视频合成中实现近乎无损的高速加速效果。
English: TaylorSeer accelerates diffusion models by predicting future timestep features using Taylor series expansion, achieving near-lossless high-speed performance in image and video synthesis without extra training.

Authors:Chengzhi Lin, Chuyuan Wang, Annan Xie, Wuhong Wang, Ziye Zhang, Canguang Ruan, Yuancai Huang, Yongqi Liu
Title: AlignPxtr: Aligning Predicted Behavior Distributions for Bias-Free Video Recommendations
Abstract:
In video recommendation systems, user behaviors such as watch time, likes, and follows are commonly used to infer user interest. However, these behaviors are influenced by various biases, including duration bias, demographic biases, and content category biases, which obscure true user preferences. In this paper, we hypothesize that biases and user interest are independent of each other. Based on this assumption, we propose a novel method that aligns predicted behavior distributions across different bias conditions using quantile mapping, theoretically guaranteeing zero mutual information between bias variables and the true user interest. By explicitly modeling the conditional distributions of user behaviors under different biases and mapping these behaviors to quantiles, we effectively decouple user interest from the confounding effects of various biases. Our approach uniquely handles both continuous signals (e.g., watch time) and discrete signals (e.g., likes, comments), while simultaneously addressing multiple bias dimensions. Additionally, we introduce a computationally efficient mean alignment alternative technique for practical real-time inference in large-scale systems. We validate our method through online A/B testing on two major video platforms: Kuaishou Lite and Kuaishou. The results demonstrate significant improvements in user engagement and retention, with \textbf{cumulative lifts of 0.267\% and 0.115\% in active days, and 1.102\% and 0.131\% in average app usage time}, respectively. The results demonstrate that our approach consistently achieves significant improvements in long-term user retention and substantial gains in average app usage time across different platforms. Our core code will be publised at https://github.com/justopit/CQE.
中文: 本文提出了一种通过分位数映射对齐行为分布的新方法,有效分离视频推荐系统中的用户兴趣与各种偏见,在多个平台上显著提升了用户参与度和留存率。
English: This paper introduces a novel method that decouples user interest from biases in video recommendation systems by aligning behavior distributions via quantile mapping, achieving significant improvements in user engagement and retention across platforms.

Authors:Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung
Title: Iterative Prompt Relocation for Distribution-Adaptive Visual Prompt Tuning
Abstract:
Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at https://github.com/ckshang/PRO-VPT.
中文摘要:本文提出PRO-VPT框架,通过迭代式提示重定位策略自适应优化提示分布,显著提升视觉提示微调性能,在多个基准测试中达到最先进水平。
English Summary: This paper introduces PRO-VPT, an adaptive prompt distribution optimization framework that iteratively relocates prompts between model blocks to enhance visual prompt tuning performance, achieving state-of-the-art results on multiple benchmarks.

Authors:Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung
Title: PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation
Abstract:
Visual prompt tuning (VPT), i.e., fine-tuning some lightweight prompt tokens, provides an efficient and effective approach for adapting pre-trained models to various downstream tasks. However, most prior art indiscriminately uses a fixed prompt distribution across different tasks, neglecting the importance of each block varying depending on the task. In this paper, we introduce adaptive distribution optimization (ADO) by tackling two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through empirical analysis, we first confirm that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution built upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy derived from this formulation, comprising two steps: pruning idle prompts from prompt-saturated blocks, followed by allocating these prompts to the most prompt-needed blocks. By iteratively performing prompt relocation and VPT, our proposal can adaptively learn the optimal prompt distribution in a nested optimization-based manner, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms advanced VPT methods, e.g., PRO-VPT surpasses VPT by 1.6 pp and 2.0 pp average accuracy, leading prompt-based methods to state-of-the-art performance on VTAB-1k and FGVC benchmarks. The code is available at https://github.com/ckshang/PRO-VPT.
中文摘要:本文提出PRO-VPT框架,通过迭代式提示重定位策略自适应优化提示分布,显著提升视觉提示微调性能,在多个基准测试中达到最先进水平。
English Summary: This paper introduces PRO-VPT, an adaptive prompt distribution optimization framework that iteratively relocates prompts between model blocks to enhance visual prompt tuning performance, achieving state-of-the-art results on multiple benchmarks.

Authors:Xupeng Xie, Ruoyu Geng, Jun Ma, Boyu Zhou
Title: AKF-LIO: LiDAR-Inertial Odometry with Gaussian Map by Adaptive Kalman Filter
Abstract:
Existing LiDAR-Inertial Odometry (LIO) systems typically use sensor-specific or environment-dependent measurement covariances during state estimation, leading to laborious parameter tuning and suboptimal performance in challenging conditions (e.g., sensor degeneracy and noisy observations). Therefore, we propose an Adaptive Kalman Filter (AKF) framework that dynamically estimates time-varying noise covariances of LiDAR and Inertial Measurement Unit (IMU) measurements, enabling context-aware confidence weighting between sensors. During LiDAR degeneracy, the system prioritizes IMU data while suppressing contributions from unreliable inputs like moving objects or noisy point clouds. Furthermore, a compact Gaussian-based map representation is introduced to model environmental planarity and spatial noise. A correlated registration strategy ensures accurate plane normal estimation via pseudo-merge, even in unstructured environments like forests. Extensive experiments validate the robustness of the proposed system across diverse environments, including dynamic scenes and geometrically degraded scenarios. Our method achieves reliable localization results across all MARS-LVIG sequences and ranks 8th on the KITTI Odometry Benchmark. The code will be released at https://github.com/xpxie/AKF-LIO.git.
中文摘要:该研究提出的自适应卡尔曼滤波框架能动态估计激光雷达和惯性测量单元的噪声协方差,通过环境感知的传感器置信度加权和紧凑地图表示,在各类复杂场景中实现了鲁棒的定位性能。
English Summary: The proposed Adaptive Kalman Filter framework dynamically adjusts LiDAR and IMU noise covariances to enable robust sensor fusion, achieving superior performance in challenging conditions through adaptive confidence weighting and environmental modeling.

Authors:Mengting Ai, Tianxin Wei, Yifan Chen, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, Hanghang Tong, Jingrui He
Title: ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Abstract:
Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.
中文: ResMoE是一种创新框架,通过利用Wasserstein质心减少专家参数高达75%,无需重新训练即可提升混合专家Transformer的空间效率,同时保持性能。
English: ResMoE is an innovative framework that improves space efficiency in Mixture-of-Experts Transformers by using Wasserstein barycenter to reduce parameters by up to 75% while maintaining performance, without requiring retraining.

Authors:Ta Duc Huy, Sen Kim Tran, Phan Nguyen, Nguyen Hoang Tran, Tran Bao Sam, Anton van den Hengel, Zhibin Liao, Johan W. Verjans, Minh-Son To, Vu Minh Hieu Phan
Title: Interactive Medical Image Analysis with Concept-based Similarity Reasoning
Abstract:
The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at https://github.com/tadeephuy/InteractCSR.
Chinese: 本文提出了基于概念的相似性推理网络(CSR),通过提供具有内在概念解释的补丁级原型和空间交互性,增强了医学影像的可解释性,并在三个生物医学数据集上比现有最优方法性能提升高达4.5%。
English: The paper introduces the Concept-based Similarity Reasoning network (CSR), which provides patch-level prototypes with intrinsic concept interpretation and spatial interactivity, improving interpretability in medical imaging and outperforming prior methods by up to 4.5% on biomedical datasets.

Authors:Junhao Zhang, Richong Zhang, Fanshuang Kong, Ziyang Miao, Yanhan Ye, Yaowei Zheng
Title: Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation
Abstract:
Existing long-text generation methods primarily concentrate on producing lengthy texts from short inputs, neglecting the long-input and long-output tasks. Such tasks have numerous practical applications while lacking available benchmarks. Moreover, as the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon. In this paper, we first introduce a Long Input and Output Benchmark (LongInOutBench), including a synthetic dataset and a comprehensive evaluation framework, addressing the challenge of the missing benchmark. We then develop the Retrieval-Augmented Long-Text Writer (RAL-Writer), which retrieves and restates important yet overlooked content, mitigating the "lost-in-the-middle" issue by constructing explicit prompts. We finally employ the proposed LongInOutBench to evaluate our RAL-Writer against comparable baselines, and the results demonstrate the effectiveness of our approach. Our code has been released at https://github.com/OnlyAR/RAL-Writer.
中文: 本文提出了长输入输出基准LongInOutBench,并开发了RAL-Writer方法,通过检索和重述被忽略内容来解决"中间丢失"问题,评估结果验证了该方法的有效性。
English: This paper introduces LongInOutBench, a benchmark for long-input and long-output text generation tasks, and proposes RAL-Writer, a method that retrieves and restates overlooked content to address the "lost-in-the-middle" problem, demonstrating its effectiveness through evaluation.

Authors:Wanjing Huang, Tongjie Pan, Yalan Ye
Title: Graphormer-Guided Task Planning: Beyond Static Rules with LLM Safety Perception
Abstract:
Recent advancements in large language models (LLMs) have expanded their role in robotic task planning. However, while LLMs have been explored for generating feasible task sequences, their ability to ensure safe task execution remains underdeveloped. Existing methods struggle with structured risk perception, making them inadequate for safety-critical applications where low-latency hazard adaptation is required. To address this limitation, we propose a Graphormer-enhanced risk-aware task planning framework that combines LLM-based decision-making with structured safety modeling. Our approach constructs a dynamic spatio-semantic safety graph, capturing spatial and contextual risk factors to enable online hazard detection and adaptive task refinement. Unlike existing methods that rely on predefined safety constraints, our framework introduces a context-aware risk perception module that continuously refines safety predictions based on real-time task execution. This enables a more flexible and scalable approach to robotic planning, allowing for adaptive safety compliance beyond static rules. To validate our framework, we conduct experiments in the AI2-THOR environment. The experiments results validates improvements in risk detection accuracy, rising safety notice, and task adaptability of our framework in continuous environments compared to static rule-based and LLM-only baselines. Our project is available at https://github.com/hwj20/GGTP
中文摘要: 本文提出了一种基于Graphormer的风险感知任务规划框架,通过将大语言模型决策与动态安全建模相结合,实现了实时危险检测和自适应任务优化,从而显著提升了机器人在连续环境中的安全性能。
English Summary: This paper introduces a Graphormer-enhanced risk-aware task planning framework that integrates LLM-based decision-making with dynamic safety modeling to improve robotic safety through real-time hazard detection and adaptive task refinement.

Authors:Sungsik Kim, Janghyun Baek, Jinkyu Kim, Jaekoo Lee
Title: GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought
Abstract:
While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at https://github.com/ai-kmu/GUIDE-CoT.
Chinese Summary: 提出的GUIDE-CoT方法通过整合目标导向的视觉提示和思维链推理,克服了大语言模型在行人轨迹预测中的视觉信息利用不足和轨迹生成困难等局限,实现了可控路径生成和最先进的预测性能。
English Summary: The proposed GUIDE-CoT method overcomes LLMs' limitations in pedestrian trajectory prediction by integrating goal-oriented visual prompts and chain-of-thought reasoning, achieving state-of-the-art performance with controllable path generation.

Authors:Siyu Li, Yihong Cao, Hao Shi, Yongsheng Zang, Xuan He, Kailun Yang, Zhiyong Li
Title: HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors
Abstract:
The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at https://github.com/lynn-yu/HierDAMap.
中文摘要:本文提出HierDAMap框架,通过全局、稀疏和实例三个层次整合视角先验知识,解决鸟瞰图映射在无标签真实场景中的跨域适应问题,显著提升自动驾驶视觉感知的泛化能力。
English Summary: This paper introduces HierDAMap, a hierarchical domain adaptation framework that enhances BEV mapping for autonomous driving by incorporating perspective priors at global, sparse, and instance levels to address cross-domain challenges in unlabeled environments.

Authors:Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin
Title: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Abstract:
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
中文: Vision-R1通过构建无需人工标注的20万规模多模态思维链数据集,并采用渐进式训练策略,有效提升了多模态数学推理能力,在多个基准测试中表现优异。
English: Vision-R1 is a multimodal reasoning model that enhances reasoning capabilities by creating a high-quality 200K multimodal CoT dataset without human annotation and employing progressive training strategies, achieving significant improvements on math reasoning benchmarks.

Authors:Hantao Zhang, Yuhe Liu, Jiancheng Yang, Weidong Guo, Xinyuan Wang, Pascal Fua
Title: DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion
Abstract:
Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at https://github.com/M3DV/DiffAtlas.
中文: DiffAtlas提出了一种新颖的生成框架,在训练中通过扩散过程同时建模图像和掩码,显著提升了医学图像分割的鲁棒性和泛化能力,尤其在数据有限和跨模态场景下优于现有方法。
English: DiffAtlas introduces a generative framework that models both images and masks using diffusion during training, enhancing robustness and generalization in medical image segmentation, particularly outperforming existing methods in limited-data and cross-modality scenarios.

Authors:Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Title: PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Abstract:
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
中文: PFDial数据集基于440个UML流程图构建了12,705条中文对话指令,实验表明仅用800样本训练的7B模型和全量训练的0.5B模型均能突破90%准确率,在流程驱动对话任务中最高可超越GPT-4o达43.88%。
English: The PFDial dataset, comprising 12,705 Chinese dialogue instructions derived from 440 UML flowcharts, enables small models like 7B and 0.5B to achieve over 90% accuracy in process-driven dialogue tasks, even surpassing GPT-4o by up to 43.88%.

Authors:Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Abstract:
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
中文: InftyThink通过中间摘要的迭代推理范式,在保持有限计算成本的同时实现无限推理深度,无需架构修改即可在多个基准测试中提升性能表现。
English: InftyThink introduces an iterative reasoning paradigm with intermediate summarization that enables unbounded reasoning depth while maintaining bounded computational costs, achieving performance improvements across multiple benchmarks without architectural modifications.

Authors:Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Abstract:
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
中文: InftyThink通过中间摘要的迭代推理范式,在保持有限计算成本的同时实现无限推理深度,无需架构修改即可在多个基准测试中提升性能表现。
English: InftyThink introduces an iterative reasoning paradigm with intermediate summarization that enables unbounded reasoning depth while maintaining bounded computational costs, achieving performance improvements across multiple benchmarks without architectural modifications.

Authors:Hantao Zhou, Rui Yang, Longxiang Tang, Guanyi Qin, Runze Hu, Xiu Li
Title: Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts
Abstract:
Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present \textbf{Gamma}, a \textbf{G}eneric im\textbf{A}ge assess\textbf{M}ent model using \textbf{M}ixture of \textbf{A}ssessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Codes are available at https://github.com/zht8506/Gamma.
中文摘要:本文提出Gamma模型,通过混合评估专家模块和场景差异提示策略,实现了跨多种场景的统一图像质量评估,在覆盖更广场景的同时显著优于现有混合训练方法。
English Summary: The paper introduces Gamma, a unified image assessment model that employs a Mixture of Assessment Experts and Scene-based Differential Prompts to effectively evaluate images across diverse scenarios, outperforming existing methods while covering more scenes.

Authors:AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian Shen, Chengshi Shi, Mingkang Shi, Modi Shi, Chonghao Sima, Jianheng Song, Huijie Wang, Wenhao Wang, Dafeng Wei, Chengen Xie, Guo Xu, Junchi Yan, Cunbiao Yang, Lei Yang, Shukai Yang, Maoqing Yao, Jia Zeng, Chi Zhang, Qinglin Zhang, Bin Zhao, Chengyue Zhao, Jiaqi Zhao, Jianchao Zhu
Title: AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Abstract:
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
中文: 本研究推出了AgiBot World大规模机器人数据集平台和GO-1通用策略,通过利用海量高质量数据显著提升了机器人操作性能,在复杂任务中表现优异。
English: This research introduces AgiBot World, a large-scale platform with over 1 million robot trajectories, and GO-1, a generalist policy that significantly improves robotic manipulation performance by leveraging extensive, high-quality data.

Authors:Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, S. Kevin Zhou
Title: AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
Abstract:
Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at https://github.com/Mwxinnn/AA-CLIP.
中文: 提出的异常感知CLIP(AA-CLIP)通过文本锚点和视觉特征对齐的两阶段方法增强CLIP的异常识别能力,在工业和医疗应用的零样本异常检测中取得了最先进的结果。
English: The proposed Anomaly-Aware CLIP (AA-CLIP) enhances CLIP's anomaly discrimination through a two-stage approach using text anchors and visual feature alignment, achieving state-of-the-art zero-shot anomaly detection results in industrial and medical applications.

Authors:Yu Zhou, Bingyan Liu
Title: BTFL: A Bayesian-based Test-Time Generalization Method for Internal and External Data Distributions in Federated learning
Abstract:
Federated Learning (FL) enables multiple clients to collaboratively develop a global model while maintaining data privacy. However, online FL deployment faces challenges due to distribution shifts and evolving test samples. Personalized Federated Learning (PFL) tailors the global model to individual client distributions, but struggles with Out-Of-Distribution (OOD) samples during testing, leading to performance degradation. In real-world scenarios, balancing personalization and generalization during online testing is crucial and existing methods primarily focus on training-phase generalization. To address the test-time trade-off, we introduce a new scenario: Test-time Generalization for Internal and External Distributions in Federated Learning (TGFL), which evaluates adaptability under Internal Distribution (IND) and External Distribution (EXD). We propose BTFL, a Bayesian-based test-time generalization method for TGFL, which balances generalization and personalization at the sample level during testing. BTFL employs a two-head architecture to store local and global knowledge, interpolating predictions via a dual-Bayesian framework that considers both historical test data and current sample characteristics with theoretical guarantee and faster speed. Our experiments demonstrate that BTFL achieves improved performance across various datasets and models with less time cost. The source codes are made publicly available at https://github.com/ZhouYuCS/BTFL .
中文摘要:联邦学习面临分布偏移和测试时适应性挑战,BTFL方法通过贝叶斯框架在测试阶段平衡个性化与泛化能力,在不同数据集上实现了更好性能。
English Summary: Federated Learning faces challenges with distribution shifts and test-time adaptability, which the proposed BTFL method addresses by balancing personalization and generalization using a Bayesian framework for improved performance across datasets.

Authors:Jinmyeong An, Sangwon Ryu, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Title: Revisiting Early Detection of Sexual Predators via Turn-level Optimization
Abstract:
Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at https://github.com/jinmyeongAN/SCoRL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.
中文摘要:本研究提出的SCoRL方法通过引入话轮级风险评估和速度控制奖励机制,能主动识别网络诱骗的最佳干预时机,有效解决了以往基于聊天级标签方法导致的监管滞后问题。
English Summary: The proposed SCoRL method uses reinforcement learning with turn-level risk assessment to proactively identify optimal intervention moments in online grooming, overcoming previous methods' reliance on chat-level labels that caused delayed detection.

Authors:Hasan Abed Al Kader Hammoud, Bernard Ghanem
Title: DiffCLIP: Differential Attention Meets CLIP
Abstract:
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.
中文摘要:DiffCLIP通过在CLIP模型中引入差分注意力机制,以极少的计算开销显著提升了图像-文本理解任务的性能,同时保持了高效性。
English Summary: DiffCLIP enhances the CLIP model by integrating a differential attention mechanism, which improves performance on image-text tasks with minimal computational cost and no efficiency loss.

Authors:Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, Shuxiang Song
Title: Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
Abstract:
Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at https://github.com/GXNU-ZhongLab/SGLATrack.
中文: 针对无人机跟踪,本研究通过相似性引导的层自适应方法精简视觉变换器结构,动态去除冗余层并保留最优单层,实现了高速实时跟踪且保持高精度。
English: Vision transformers are optimized for UAV tracking by dynamically disabling redundant layers and retaining only the most effective one, achieving high-speed real-time performance without compromising accuracy.

Authors:Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song
Title: Dynamic Updates for Language Adaptation in Visual-Language Tracking
Abstract:
The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.
中文: DUTrack提出了一种动态视觉语言跟踪框架,通过大语言模型生成动态描述和自适应模板更新多模态参考信息,以保持语义一致性并提升跟踪器鲁棒性,在多个基准测试中取得了最优性能。
English: DUTrack introduces a dynamic vision-language tracking framework that updates multi-modal references using a large language model and adaptive templates to maintain semantic consistency and enhance tracker robustness, achieving state-of-the-art results on multiple benchmarks.

Authors:Ruchi Bhatt, Shreya Bansal, Amanpreet Chander, Rupinder Kaur, Malya Singh, Mohan Kankanhalli, Abdulmotaleb El Saddik, Mukesh Kumar Saini
Title: GroMo: Plant Growth Modeling with Multiview Images
Abstract:
Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at https://github.com/mriglab/GroMo-Plant-Growth-Modeling-with-Multiview-Images.
中文: GroMo挑战赛通过提供多视角图像数据集,提出植物年龄预测和叶片计数两项任务以推动农业表型研究,其MVVT模型实现了较低误差率。
English: The GroMo Challenge introduces a multiview image dataset and tasks for predicting plant age and leaf count to advance agricultural phenotyping, with a proposed MVVT model achieving low error rates.

Authors:Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
Title: Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Abstract:
The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
中文摘要:本研究将大型语言模型与神经机器翻译相结合,开发出高效通用的翻译系统,在保持跨任务强泛化能力的同时,实现了更优的翻译质量、更快的推理速度和更低的内存占用。
English Summary: This study integrates large language models with neural machine translation to create efficient and universally applicable translation systems, achieving superior translation quality, faster inference speeds, and reduced memory usage while maintaining strong generalization across tasks.

Authors:Yixin Yang, Yang Zhou, Hui Huang
Title: Introducing Unbiased Depth into 2D Gaussian Splatting for High-accuracy Surface Reconstruction
Abstract:
Recently, 2D Gaussian Splatting (2DGS) has demonstrated superior geometry reconstruction quality than the popular 3DGS by using 2D surfels to approximate thin surfaces. However, it falls short when dealing with glossy surfaces, resulting in visible holes in these areas. We find that the reflection discontinuity causes the issue. To fit the jump from diffuse to specular reflection at different viewing angles, depth bias is introduced in the optimized Gaussian primitives. To address that, we first replace the depth distortion loss in 2DGS with a novel depth convergence loss, which imposes a strong constraint on depth continuity. Then, we rectify the depth criterion in determining the actual surface, which fully accounts for all the intersecting Gaussians along the ray. Qualitative and quantitative evaluations across various datasets reveal that our method significantly improves reconstruction quality, with more complete and accurate surfaces than 2DGS. Code is available at https://github.com/XiaoXinyyx/Unbiased_Surfel.
中文摘要:本文针对二维高斯泼溅技术在光泽表面重建中的不足,提出新型深度收敛损失函数和改进的深度判定准则,相比原有方法显著提升了表面重建的完整度与精确度。
English Summary: This paper introduces a novel depth convergence loss and refined depth criterion to address 2D Gaussian Splatting's limitations in reconstructing glossy surfaces, significantly improving reconstruction completeness and accuracy compared to previous methods.

Authors:Yuxiang Zhang, Yuqi Yang, Jiangming Shu, Xinyan Wen, Jitao Sang
Title: Agent models: Internalizing Chain-of-Action Generation into Reasoning models
Abstract:
Traditional agentic workflows rely on external prompts to manage interactions with tools and the environment, which limits the autonomy of reasoning models. We position \emph{Large Agent Models (LAMs)} that internalize the generation of \emph{Chain-of-Action (CoA)}, enabling the model to autonomously decide when and how to use external tools. Our proposed AutoCoA framework combines supervised fine-tuning (SFT) and reinforcement learning (RL), allowing the model to seamlessly switch between reasoning and action while efficiently managing environment interactions. Main components include step-level action triggering, trajectory-level CoA optimization, and an internal world model to reduce real-environment interaction costs. Evaluations on open-domain QA tasks demonstrate that AutoCoA-trained agent models significantly outperform ReAct-based workflows in task completion, especially in tasks that require long-term reasoning and multi-step actions. Code and dataset are available at https://github.com/ADaM-BJTU/AutoCoA
中文:AutoCoA框架通过内化行动链生成增强智能体自主性,实现推理与行动的无缝切换,并通过优化环境交互在复杂任务中显著超越基于ReAct的方法。
English: The AutoCoA framework enhances agent autonomy by internalizing Chain-of-Action generation, enabling seamless reasoning-action switching and outperforming ReAct-based methods in complex tasks through optimized environment interaction.

Authors:Qiyuan He, Angela Yao
Title: Conceptrol: Concept Control of Zero-shot Personalized Image Generation
Abstract:
Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.
中文: Conceptrol是一种创新框架,通过引入文本概念掩码增强零样本适配器,在无需额外计算开销的情况下显著提升了图像生成中个性化内容与文本指令的协调能力。
English: Conceptrol is a novel framework that enhances zero-shot adapters by integrating textual concept masks to better balance personalized content with text prompts, achieving significant improvements in image generation benchmarks without extra computational cost.

Authors:Jiangdong Cai, Haotian Jiang, Zhenrong Shen, Yonghao Li, Honglin Xiong, Lichi Zhang, Qian Wang
Title: LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer Screening
Abstract:
The deployment of computer-aided diagnosis systems for cervical cancer screening using whole slide images (WSIs) faces critical challenges due to domain shifts caused by staining variations across different scanners and imaging environments. While existing stain augmentation methods improve patch-level robustness, they fail to scale to WSIs due to two key limitations: (1) inconsistent stain patterns when extending patch operations to gigapixel slides, and (2) prohibitive computational/storage costs from offline processing of augmented WSIs.To address this, we propose Latent Style Augmentation (LSA), a framework that performs efficient, online stain augmentation directly on WSI-level latent features. We first introduce WSAug, a WSI-level stain augmentation method ensuring consistent stain across patches within a WSI. Using offline-augmented WSIs by WSAug, we design and train Stain Transformer, which can simulate targeted style in the latent space, efficiently enhancing the robustness of the WSI-level classifier. We validate our method on a multi-scanner WSI dataset for cervical cancer diagnosis. Despite being trained on data from a single scanner, our approach achieves significant performance improvements on out-of-distribution data from other scanners. Code will be available at https://github.com/caijd2000/LSA.
Chinese: 所提出的潜在样式增强(LSA)框架通过在全切片图像潜在特征上直接进行高效的在线染色增强,有效解决了宫颈癌筛查中的域偏移问题,显著提升了分类器在不同扫描仪间的泛化性能,且无需多扫描仪训练数据。
English: The proposed Latent Style Augmentation (LSA) framework addresses domain shifts in cervical cancer screening by enabling efficient, online stain augmentation directly on whole slide image latent features, significantly improving classifier robustness across different scanners without requiring multi-scanner training data.

Authors:Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, Xiaokang Yang
Title: QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
Abstract:
Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$\times$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at https://github.com/JunyiWuCode/QuantCache.
Chinese: 本文提出QuantCache,一种免训练的推理加速框架,通过联合优化分层潜在缓存、自适应量化与结构剪枝,在保持生成质量的同时实现了DiT视频生成6.72倍的端到端加速。
English: This paper introduces QuantCache, a training-free inference acceleration framework that combines hierarchical latent caching, adaptive quantization, and structural pruning to achieve a 6.72× speedup in DiT-based video generation with minimal quality loss.

Authors:Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
Title: ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability
Abstract:
Unified multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate'' algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://github.com/finyorko/armor.
Chinese: ARMOR是一种资源高效的自回归框架,通过非对称编码器-解码器架构、精心策划的数据集和渐进式训练算法,将现有多模态大语言模型升级为能实现统一多模态理解与生成的系统,以最小计算开销实现流畅的图文交错生成。
English: ARMOR is a resource-efficient autoregressive framework that enhances existing multimodal large language models to achieve unified multimodal understanding and generation through an asymmetric encoder-decoder architecture, curated dataset, and progressive training algorithm, enabling seamless text-image interleaved generation with minimal computational overhead.

Authors:Xiaoyang Liu, Yuquan Wang, Zheng Chen, Jiezhang Cao, He Zhang, Yulun Zhang, Xiaokang Yang
Title: One-Step Diffusion Model for Image Motion-Deblurring
Abstract:
Currently, methods for single-image deblurring based on CNNs and transformers have demonstrated promising performance. However, these methods often suffer from perceptual limitations, poor generalization ability, and struggle with heavy or complex blur. While diffusion-based methods can partially address these shortcomings, their multi-step denoising process limits their practical usage. In this paper, we conduct an in-depth exploration of diffusion models in deblurring and propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step, significantly improving inference efficiency while maintaining high fidelity. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Additionally, we construct a high-quality synthetic deblurring dataset to mitigate perceptual collapse and design a dynamic dual-adapter (DDA) to enhance perceptual quality while preserving fidelity. Extensive experiments demonstrate that our method achieves strong performance on both full and no-reference metrics. Our code and pre-trained model will be publicly available at https://github.com/xyLiu339/OSDD.
中文: 本文提出了一种用于去模糊的一步扩散模型(OSDD),通过改进的变分自编码器和动态双适配器,在显著提高推理效率的同时保持了高保真度,并在去模糊基准测试中表现出色。
English: This paper introduces a one-step diffusion model for deblurring (OSDD) that enhances inference efficiency and maintains high fidelity through an improved variational autoencoder and a dynamic dual-adapter, validated by strong performance on deblurring benchmarks.

Authors:Chen-Lin Zhang, Lin Sui, Shuming Liu, Fangzhou Mu, Zhangcheng Wang, Bernard Ghanem
Title: TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
Abstract:
Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at https://github.com/sming256/TimeLoc.
中文: TimeLoc是一个统一的端到端时间定位框架,通过高效的单阶段模型处理未剪辑视频中的多种任务,并在多个基准测试中取得了最先进的性能。
English: TimeLoc is a unified end-to-end framework for temporal localization in untrimmed videos that handles multiple tasks through an efficient one-stage model, achieving state-of-the-art results across various benchmarks.

Authors:Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia
Title: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Abstract:
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.
Chinese: Seg-Zero是一种创新框架,通过解耦架构和强化学习解决了传统分割方法的局限性,实现了强大的零样本泛化能力和显式推理链,在基准测试中显著超越先前模型。
English: Seg-Zero is a novel framework that overcomes the limitations of traditional segmentation methods by using a decoupled architecture and reinforcement learning to achieve robust zero-shot generalization and explicit reasoning chains, significantly outperforming previous models on benchmarks.

Authors:Shinnosuke Matsuo, Riku Togashi, Ryoma Bise, Seiichi Uchida, Masahiro Nomura
Title: Instance-wise Supervision-level Optimization in Active Learning
Abstract:
Active learning (AL) is a label-efficient machine learning paradigm that focuses on selectively annotating high-value instances to maximize learning efficiency. Its effectiveness can be further enhanced by incorporating weak supervision, which uses rough yet cost-effective annotations instead of exact (i.e., full) but expensive annotations. We introduce a novel AL framework, Instance-wise Supervision-Level Optimization (ISO), which not only selects the instances to annotate but also determines their optimal annotation level within a fixed annotation budget. Its optimization criterion leverages the value-to-cost ratio (VCR) of each instance while ensuring diversity among the selected instances. In classification experiments, ISO consistently outperforms traditional AL methods and surpasses a state-of-the-art AL approach that combines full and weak supervision, achieving higher accuracy at a lower overall cost. This code is available at https://github.com/matsuo-shinnosuke/ISOAL.
Chinese: ISO框架通过动态选择实例及其标注级别,利用优化的价值成本比和多样性,以更低成本实现了更高的分类准确率。
English: The ISO framework enhances active learning by dynamically selecting both instances and their annotation levels, achieving superior accuracy at lower cost through optimized value-to-cost ratios and diversity.

Authors:Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Title: Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation
Abstract:
Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts-such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships-integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at \href{https://github.com/hadi-hosseini/noise-refinement}{https://github.com/hadi-hosseini/noise-refinement}.
中文摘要:本文提出一种无需训练的方法,通过将文本约束整合为统一损失函数并采用反馈驱动的噪声优化系统,显著提升了文本到图像生成的组合能力与空间关系准确性。
English Summary: This paper introduces a training-free method that enhances text-to-image generation by integrating textual constraints as unified losses and employing a feedback-driven noise refinement system, significantly improving compositionality and spatial accuracy.

Authors:Xirui Hu, Jiahao Wang, Hao Chen, Weizhan Zhang, Benqi Wang, Yikun Li, Haishun Nan
Title: DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
Abstract:
Recent advances in text-to-image generation have driven interest in generating personalized human images that depict specific identities from reference images. Although existing methods achieve high-fidelity identity preservation, they are generally limited to single-ID scenarios and offer insufficient facial editability. We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training. 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle and reconfigure facial motion and identity features, supporting flexible facial editing. 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k, a curated dataset of 10k unique individuals, each represented by 35 distinct facial images. Experimental results demonstrate that DynamicID outperforms state-of-the-art methods in identity fidelity, facial editability, and multi-ID personalization capability. Our code will be released at https://github.com/ByteCat-bot/DynamicID.
中文: DynamicID框架通过语义激活注意力和身份-运动重构器,无需调优即可实现高保真的单人与多人身份图像生成及灵活面部编辑,性能超越现有方法。
English: DynamicID is a novel framework that enables high-fidelity single and multi-ID image generation with enhanced facial editability through Semantic-Activated Attention and Identity-Motion Reconfigurator, outperforming existing methods without requiring fine-tuning.

Authors:Huaqi Tao, Bingxi Liu, Calvin Chen, Tingjun Huang, He Li, Jinqiang Cui, Hong Zhang
Title: TextInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
Abstract:
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene texts frequently appear in indoor spaces and can help distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene texts. Finally, the discriminative texts are filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.
中文摘要:TextInPlace是一种创新的视觉位置识别框架,通过集成场景文本检测技术来区分视觉重复的室内环境,采用双分支架构并结合新建基准数据集实现了优越性能。
English Summary: TextInPlace is a novel Visual Place Recognition framework that integrates scene text spotting to distinguish visually repetitive indoor environments, achieving superior performance through a dual-branch architecture and a newly established benchmark dataset.

Authors:Xiao Wang, Yuehang Li, Fuling Wang, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang, Bin Luo
Title: Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
Abstract:
Accurate sign language understanding serves as a crucial communication channel for individuals with disabilities. Current sign language translation algorithms predominantly rely on RGB frames, which may be limited by fixed frame rates, variable lighting conditions, and motion blur caused by rapid hand movements. Inspired by the recent successful application of event cameras in other fields, we propose to leverage event streams to assist RGB cameras in capturing gesture data, addressing the various challenges mentioned above. Specifically, we first collect a large-scale RGB-Event sign language translation dataset using the DVS346 camera, termed VECSL, which contains 15,676 RGB-Event samples, 15,191 glosses, and covers 2,568 Chinese characters. These samples were gathered across a diverse range of indoor and outdoor environments, capturing multiple viewing angles, varying light intensities, and different camera motions. Due to the absence of benchmark algorithms for comparison in this new task, we retrained and evaluated multiple state-of-the-art SLT algorithms, and believe that this benchmark can effectively support subsequent related research. Additionally, we propose a novel RGB-Event sign language translation framework (i.e., M$^2$-SLT) that incorporates fine-grained micro-sign and coarse-grained macro-sign retrieval, achieving state-of-the-art results on the proposed dataset. Both the source code and dataset will be released on https://github.com/Event-AHU/OpenESL.
中文摘要:本研究提出一种新颖的RGB-事件手语翻译方法,利用事件相机克服RGB相机的局限性,构建了VECSL数据集和M²-SLT框架,实现了最先进的性能表现。
English Summary: This study introduces a novel RGB-Event sign language translation approach using event cameras to overcome RGB limitations, creating the VECSL dataset and M²-SLT framework that achieves state-of-the-art performance.

Authors:Shijia Zhao, Qiming Xia, Xusheng Guo, Pufan Zou, Maoji Zheng, Hai Wu, Chenglu Wen, Cheng Wang
Title: SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Abstract:
Recently, sparsely-supervised 3D object detection has gained great attention, achieving performance close to fully-supervised 3D objectors while requiring only a few annotated instances. Nevertheless, these methods suffer challenges when accurate labels are extremely absent. In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. Specifically, we first develop a Confident Points Semantic Transfer (CPST) module that generates accurate cross-modal semantic prompts through boundary-constrained center cluster selection. Based on these accurate semantic prompts, which we treat as seed points, we introduce a Dynamic Cluster Pseudo-label Generation (DCPG) module to yield pseudo-supervision signals from the geometry shape of multi-scale neighbor points. Additionally, we design a Distribution Shape score (DS score) that chooses high-quality supervision signals for the initial training of the 3D detector. Experiments on the KITTI dataset and Waymo Open Dataset (WOD) have validated that SP3D can enhance the performance of sparsely supervised detectors by a large margin under meager labeling conditions. Moreover, we verified SP3D in the zero-shot setting, where its performance exceeded that of the state-of-the-art methods. The code is available at https://github.com/xmuqimingxia/SP3D.
Chinese: 本文提出SP3D增强策略,利用大型多模态模型生成的跨模态语义提示,在稀疏标注条件下显著提升3D物体检测性能,并在多个基准数据集上验证了其优越性。
English: The paper introduces SP3D, a boosting strategy that leverages cross-modal semantic prompts from Large Multimodal Models to enhance 3D object detection performance under sparse annotation settings, achieving significant improvements on benchmark datasets.

Authors:Yanbiao Ma, Wei Dai, Wenke Huang, Jiayi Chen
Title: Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
Abstract:
Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: https://github.com/WeiDai-David/2025CVPR_GGEUR
中文摘要:本文提出一种几何引导的数据生成方法,通过全局嵌入形状在本地模拟全局分布,显著提升了联邦学习在标签偏移、域偏移等高度异构数据场景下的性能表现。
English Summary: This paper introduces a geometry-guided data generation method that uses global embedding shapes to simulate the global distribution locally, significantly improving federated learning performance under highly heterogeneous data conditions including label and domain skew.

Authors:Chengxuan Qian, Kai Han, Jingchao Wang, Zhenlong Yuan, Chongwen Lyu, Jun Chen, Zhe Liu
Title: DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning
Abstract:
Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.
Chinese: 本文提出DynCIM动态课程学习框架,通过动态评估样本难度和模态贡献度,并自适应融合模态,有效解决了多模态学习中的不平衡问题,在多种数据集上显著提升了性能。
English: This paper introduces DynCIM, a dynamic curriculum learning framework that addresses imbalances in multimodal learning by dynamically assessing sample difficulty and modality contributions, and adaptively fusing modalities to enhance performance across diverse datasets.

Authors:Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li
Title: M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification
Abstract:
Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.
Chinese: 提出的M$^3$amba模型利用CLIP驱动的适配器和多模态Mamba架构,在遥感多模态融合中提升了语义整合能力与计算效率,实现了显著的性能提升和训练效率优化。
English: The proposed M$^3$amba model leverages CLIP-driven adapters and a multi-modal Mamba architecture to enhance semantic integration and computational efficiency in remote sensing fusion, achieving significant performance gains and training efficiency improvements.

Authors:Yu Jin, Jingming Liu, Zhexu Luo, Yifei Peng, Ziang Qin, Wang-Zhou Dai, Yao-Xiang Ding, Kun Zhou
Title: Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning
Abstract:
Visual generative abductive learning studies jointly training symbol-grounded neural visual generator and inducing logic rules from data, such that after learning, the visual generation process is guided by the induced logic rules. A major challenge for this task is to reduce the time cost of logic abduction during learning, an essential step when the logic symbol set is large and the logic rule to induce is complicated. To address this challenge, we propose a pre-training method for obtaining meta-rule selection policy for the recently proposed visual generative learning approach AbdGen [Peng et al., 2023], aiming at significantly reducing the candidate meta-rule set and pruning the search space. The selection model is built based on the embedding representation of both symbol grounding of cases and meta-rules, which can be effectively integrated with both neural model and logic reasoning system. The pre-training process is done on pure symbol data, not involving symbol grounding learning of raw visual inputs, making the entire learning process low-cost. An additional interesting observation is that the selection policy can rectify symbol grounding errors unseen during pre-training, which is resulted from the memorization ability of attention mechanism and the relative stability of symbolic patterns. Experimental results show that our method is able to effectively address the meta-rule selection problem for visual abduction, boosting the efficiency of visual generative abductive learning. Code is available at https://github.com/future-item/metarule-select.
中文摘要:本研究提出一种预训练方法,通过元规则选择策略缩减逻辑规则搜索空间,在无需视觉输入的情况下低成本提升视觉生成溯因学习的效率,并能修正符号基础错误。
English Summary: This study introduces a pre-training method to enhance visual generative abductive learning by reducing logic rule search space through meta-rule selection, significantly improving efficiency without compromising performance.

Authors:Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Title: How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders
Abstract:
Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
中文: 大语言模型在训练过程中先习得各语言特有知识,随后建立跨语言关联,并从词汇层面学习逐步过渡到掌握更高层次的抽象概念。
English: Large Language Models initially develop language-specific knowledge and then learn cross-linguistic patterns, progressing from token-level information to higher-level abstract concepts during training.

Authors:Guofeng Zhang, Ruyi Zha, Hao He, Yixun Liang, Alan Yuille, Hongdong Li, Yuanhao Cai
Title: X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second
Abstract:
Sparse-view 3D CT reconstruction aims to recover volumetric structures from a limited number of 2D X-ray projections. Existing feedforward methods are constrained by the limited capacity of CNN-based architectures and the scarcity of large-scale training datasets. In this paper, we propose an X-ray Large Reconstruction Model (X-LRM) for extremely sparse-view (<10 views) CT reconstruction. X-LRM consists of two key components: X-former and X-triplane. Our X-former can handle an arbitrary number of input views using an MLP-based image tokenizer and a Transformer-based encoder. The output tokens are then upsampled into our X-triplane representation, which models the 3D radiodensity as an implicit neural field. To support the training of X-LRM, we introduce Torso-16K, a large-scale dataset comprising over 16K volume-projection pairs of various torso organs. Extensive experiments demonstrate that X-LRM outperforms the state-of-the-art method by 1.5 dB and achieves 27x faster speed and better flexibility. Furthermore, the downstream evaluation of lung segmentation tasks also suggests the practical value of our approach. Our code, pre-trained models, and dataset will be released at https://github.com/caiyuanhao1998/X-LRM
中文: 本文提出X-LRM模型,通过基于Transformer的编码器和三平面表示实现稀疏视图3D CT重建,在性能、速度和灵活性上均超越现有方法,并配套发布了大规模数据集。
English: This paper introduces X-LRM, a novel model for sparse-view 3D CT reconstruction that utilizes a Transformer-based encoder and triplane representation to achieve superior performance, faster speed, and enhanced flexibility, supported by a newly created large-scale dataset.

Authors:Samuel Garcin, Trevor McInroe, Pablo Samuel Castro, Prakash Panangaden, Christopher G. Lucas, David Abel, Stefano V. Albrecht
Title: Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning
Abstract:
Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment -- the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at https://github.com/francelico/deac-rep.
中文: 本研究探讨了行动者-评论家算法中分离表征的效益,发现独立表征使行动者专注动作相关信息而评论家专精价值和动态信息,从而提升样本效率和探索能力。
English: This study investigates whether actor-critic algorithms benefit from separate representations for the actor and critic, finding that specialized representations enable the actor to focus on action-relevant information while the critic encodes value and dynamics, improving sample efficiency and exploration.

Authors:Qizhe Wu, Huawen Liang, Yuchen Gui, Zhichen Zeng, Zerong He, Linfeng Tao, Xiaotian Wang, Letian Zhao, Zhaoxi Zeng, Wei Yuan, Wei Wu, Xi Jin
Title: Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Abstract:
General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines
本研究通过聚焦矩阵乘法中的比特权重维度,提出了一种新的张量处理器硬件优化方法,在多种架构上实现了面积和能效的显著提升。
This work introduces a novel hardware optimization approach for tensor processing engines by focusing on bit-weight dimensions in matrix multiplication, achieving significant improvements in area and energy efficiency across multiple architectures.

Authors:Mohit Pandey, Gopeshh Subbaraj, Artem Cherkasov, Martin Ester, Emmanuel Bengio
Title: Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation
Abstract:
Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks. The code is accessible at https://github.com/diamondspark/AGFN.
Chinese: 原子生成流网络(A-GFNs)提出了一种基础生成模型,以单个原子为构建单元,通过基于代理奖励的无监督预训练和目标导向微调,实现了对类药化学空间更全面的探索。
English: Atomic GFlowNets (A-GFNs) introduce a foundational generative model that uses individual atoms as building blocks, enabling broader exploration of drug-like chemical space through unsupervised pre-training with proxy rewards and goal-conditioned fine-tuning for specific properties.

Authors:Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu
Title: DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation
Abstract:
Earth observation (EO) spans a broad spectrum of modalities, including optical, radar, multispectral, and hyperspectral data, each capturing distinct environmental signals. However, current vision-language models in EO, particularly CLIP-based variants, remain confined to individual modalities, limiting generalization and scalability across diverse tasks. We present DOFA-CLIP (Dynamic-One-For-All CLIP), a unified vision-language foundation model that dynamically adapts to EO modalities with flexible spectral configurations through a single Transformer backbone. Our approach introduces three key contributions: 1) the construction of GeoLangBind-2M, a large-scale EO image-text dataset covering six heterogeneous modalities with rich natural language descriptions; 2) a novel training strategy called VECT (Vision-models Enhanced Contrastive Text-image pretraining), which enhances the spatial awareness of CLIP features with multiple vision foundation models; and 3) a Modality-aware Knowledge Agglomeration (MaKA) module that refines feature distillation with modality-specific awareness. DOFA-CLIP achieves state-of-the-art zero-shot performance across a wide range of EO benchmarks, including unseen modalities and a diverse number of input spectral bands. Together, these contributions establish a scalable foundation for multimodal EO understanding and open new avenues for integrating heterogeneous EO data with large language models. Code and datasets will be released. Code and datasets are publicly available.
中文: DOFA-CLIP 是一个统一的视觉语言基础模型,通过单一Transformer主干动态适应多种地球观测模态,结合大规模数据集和创新训练策略,在各类EO基准测试中实现了最先进的零样本性能。
English: DOFA-CLIP is a unified vision-language foundation model that dynamically adapts to multiple Earth observation modalities through a single Transformer backbone, achieving state-of-the-art zero-shot performance across diverse EO benchmarks by integrating large-scale datasets and novel training strategies.

Authors:Siyi Du, Xinzhe Luo, Declan P. O'Regan, Chen Qin
Title: STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
Abstract:
Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is available at https://github.com/siyi-wind/STiL.
中文: 多模态图像-表格学习面临标注数据有限和任务无关特征欠佳的挑战,STiL框架通过解耦对比一致性模块和共识引导伪标签策略,有效利用标注与未标注数据中的共享及模态特定信息,从而提升性能。
English: Multimodal image-tabular learning faces challenges from limited labeled data and suboptimal task-agnostic features, which the proposed STiL framework addresses by integrating a disentangled contrastive consistency module and consensus-guided pseudo-labeling to effectively leverage both shared and modality-specific information across labeled and unlabeled data.

Authors:Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
Title: Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Abstract:
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
中文:Zero-AVSR框架通过生成语言无关的罗马化文本表征,并利用大语言模型将其转换为目标语言,实现了零样本多语言视听语音识别,其有效性已通过大规模多语言实验验证。
English: The Zero-AVSR framework enables zero-shot multilingual audio-visual speech recognition by generating language-agnostic Roman text representations and leveraging large language models for conversion into target languages, validated through extensive multilingual experiments.

Authors:Thomas Winninger, Boussad Addad, Katarzyna Kapusta
Title: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Abstract:
Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.
中文摘要:本研究提出了一种新颖的白盒对抗攻击方法,通过结合机制可解释性与梯度优化,将模型嵌入从拒绝子空间高效重定向至接受子空间,在先进大语言模型上实现80-95%的攻击成功率,仅需数分钟且显著降低计算成本。
English Summary: This study introduces a novel white-box adversarial attack method that combines mechanistic interpretability with gradient optimization to efficiently reroute model embeddings from refusal to acceptance subspaces, achieving high success rates of 80-95% on advanced LLMs within minutes while significantly reducing computational costs.

Authors:Kun Xiang, Zhili Liu, Zihao Jiang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Haoxiang Fan, Hanhui Li, Weiran Huang, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang
Title: Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
Abstract:
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking. To introduce structured reasoning capabilities into visual understanding models, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10\% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3\%. Our code is now public available in https://github.com/Quinn777/AtomThink.
中文: 本文提出自结构化思维链范式和AtomThink框架,通过最小语义原子步骤动态整合多模态大语言模型的推理能力,显著提升了数学推理的准确性和效率。
English: This paper introduces the Self-structured Chain of Thought (SCoT) paradigm and AtomThink framework to enhance multimodal mathematical reasoning in MLLMs by dynamically combining reasoning abilities through minimal semantic atomic steps, achieving significant accuracy gains and efficiency improvements.

Authors:Aditya Shankar, Lydia Y. Chen, Arie van Deursen, Rihan Hai
Title: WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models
Abstract:
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
中文: WaveStitch是一种基于扩散的新方法,通过同时结合元数据和部分观测信号进行双源调节,采用混合架构提升适应性,并在保持时间连贯性的前提下实现快速并行生成,从而在精度和速度上均优于现有技术。
English: WaveStitch is a novel diffusion-based method that overcomes key limitations in temporal data generation by conditioning on both metadata and observed signals, employing a hybrid architecture for adaptability, and enabling fast parallel generation while maintaining temporal coherence, achieving superior accuracy and speed.

Authors:Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
Abstract:
Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.
中文: 本研究提出了一种新颖的合成图像检测方法,通过利用多时间步中间噪声图像特征,在常规样本和挑战性基准上分别实现了98.91%和95.89%的最优检测准确率,同时提供了可解释的AI生成缺陷识别功能。
English: This study introduces a novel synthetic image detection method that leverages intermediate noised image features across multiple timesteps, achieving state-of-the-art detection accuracy of 98.91% on regular samples and 95.89% on challenging benchmarks while providing explainable AI-generated flaw identification.

Authors:YingLiang Ma, Sandra Howell, Aldo Rinaldi, Tarv Dhanjal, Kawal S. Rhode
Title: Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images
Abstract:
Objective: Interventional devices, catheters and insertable imaging devices such as transesophageal echo (TOE) probes are routinely used in minimally invasive cardiovascular procedures. Detecting their positions and orientations in X-ray fluoroscopic images is important for many clinical applications. Method: In this paper, a novel attention mechanism was designed to guide a convolution neural network (CNN) model to the areas of wires in X-ray images, as nearly all interventional devices and catheters used in cardiovascular procedures contain wires. The attention mechanism includes multi-scale Gaussian derivative filters and a dot-product-based attention layer. By utilizing the proposed attention mechanism, a lightweight foundation model can be created to detect multiple objects simultaneously with higher precision and real-time speed. Results: The proposed model was trained and tested on a total of 12,438 X-ray images. An accuracy of 0.88 was achieved for detecting an echo probe and 0.87 for detecting an artificial valve at 58 FPS. The accuracy was measured by intersection-over-union (IoU). We also achieved a 99.8% success rate in detecting a 10-electrode catheter and a 97.8% success rate in detecting an ablation catheter. Conclusion: Our detection foundation model can simultaneously detect and identify both interventional devices and flexible catheters in real-time X-ray fluoroscopic images. Significance: The proposed model employs a novel attention mechanism to achieve high-performance object detection, making it suitable for various clinical applications and robotic-assisted surgeries. Codes are available at https://github.com/YingLiangMa/AttWire.
中文摘要:本研究提出一种新型注意力机制,通过多尺度高斯导数滤波器实现了对X射线影像中介入器械的实时高精度检测,在58帧/秒速度下对多种心血管器械的检测准确率均超过0.87。
English Summary: A novel attention mechanism using multi-scale Gaussian filters enables real-time, high-precision detection of interventional devices in X-ray images, achieving over 0.87 accuracy for various cardiovascular tools at 58 FPS.

Authors:Shawn Li, Jiashu Qu, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, Yue Zhao
Title: Treble Counterfactual VLMs: A Causal Approach to Hallucination
Abstract:
Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.
中文: 本研究提出一种因果框架,通过反事实估计和动态干预分析并减轻视觉或文本模态的意外直接影响,从而显著减少视觉语言模型的幻觉问题,在保持性能的同时提高其可靠性。
English: This study introduces a causal framework to address hallucination in Vision-Language Models by analyzing and mitigating unintended direct influences of vision or text modalities through counterfactual estimation and dynamic intervention, significantly improving reliability while maintaining performance.

Authors:Shawn Li, Peilin Cai, Yuxiao Zhou, Zhiyu Ni, Renjie Liang, You Qin, Yi Nian, Zhengzhong Tu, Xiyang Hu, Yue Zhao
Title: Secure On-Device Video OOD Detection Without Backpropagation
Abstract:
Out-of-Distribution (OOD) detection is critical for ensuring the reliability of machine learning models in safety-critical applications such as autonomous driving and medical diagnosis. While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices. To overcome these challenges, we propose SecDOOD, a secure cloud-device collaboration framework for efficient on-device OOD detection without requiring device-side backpropagation. SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. Central to SecDOOD is a HyperNetwork-based personalized parameter generation module, which adapts cloud-trained models to device-specific distributions by dynamically generating local weight adjustments, effectively combining central and local information without local fine-tuning. Additionally, our dynamic feature sampling and encryption strategy selectively encrypts only the most informative feature channels, largely reducing encryption overhead without compromising detection performance. Extensive experiments across multiple datasets and OOD scenarios demonstrate that SecDOOD achieves performance comparable to fully fine-tuned models, enabling secure, efficient, and personalized OOD detection on resource-limited edge devices. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/Dystopians/SecDOOD.
中文: SecDOOD是一种安全的云设备协作框架,无需设备端反向传播即可实现高效的分布外检测,在确保数据隐私和降低计算开销的同时,达到了与完全微调模型相当的性能。
English: SecDOOD is a secure cloud-device collaboration framework that enables efficient on-device Out-of-Distribution detection without requiring device-side backpropagation, achieving performance comparable to fully fine-tuned models while ensuring data privacy and reducing computational overhead.

Authors:Zidu Wang, Jiankuo Zhao, Miao Xu, Xiangyu Zhu, Zhen Lei
Title: SRM-Hair: Single Image Head Mesh Reconstruction via 3D Morphable Hair
Abstract:
3D Morphable Models (3DMMs) have played a pivotal role as a fundamental representation or initialization for 3D avatar animation and reconstruction. However, extending 3DMMs to hair remains challenging due to the difficulty of enforcing vertex-level consistent semantic meaning across hair shapes. This paper introduces a novel method, Semantic-consistent Ray Modeling of Hair (SRM-Hair), for making 3D hair morphable and controlled by coefficients. The key contribution lies in semantic-consistent ray modeling, which extracts ordered hair surface vertices and exhibits notable properties such as additivity for hairstyle fusion, adaptability, flipping, and thickness modification. We collect a dataset of over 250 high-fidelity real hair scans paired with 3D face data to serve as a prior for the 3D morphable hair. Based on this, SRM-Hair can reconstruct a hair mesh combined with a 3D head from a single image. Note that SRM-Hair produces an independent hair mesh, facilitating applications in virtual avatar creation, realistic animation, and high-fidelity hair rendering. Both quantitative and qualitative experiments demonstrate that SRM-Hair achieves state-of-the-art performance in 3D mesh reconstruction. Our project is available at https://github.com/wang-zidu/SRM-Hair
中文摘要:本文提出SRM-Hair方法,通过语义一致的光线建模实现三维可变形头发建模,在保持独立网格结构的同时,能够从单张图像实现最先进的头发重建,适用于多种虚拟应用场景。
English Summary: This paper introduces SRM-Hair, a novel method that enables 3D morphable hair modeling through semantic-consistent ray modeling, achieving state-of-the-art hair reconstruction from single images while maintaining independent mesh structures for versatile applications.

Authors:Jian Ma, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu, Zhenyu Yang
Title: X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
Abstract:
Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However, currently there is no straightforward and efficient framework to transfer the multimodal comprehension abilities of MLLMs to T2I models to enable them to understand multimodal inputs. In this paper, we propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities, including multilingual text, screenshot documents, images, videos, and audio. X2I is trained using merely 100K English corpus with 160 GPU hours. Building on the DiT teacher model, we adopt an innovative distillation method to extract the inference capabilities of the teacher model and design a lightweight AlignNet structure to serve as an intermediate bridge. Compared to the teacher model, X2I shows a decrease in performance degradation of less than 1\% while gaining various multimodal understanding abilities, including multilingual to image, image to image, image-text to image, video to image, audio to image, and utilizing creative fusion to enhance imagery. Furthermore, it is applicable for LoRA training in the context of image-text to image generation, filling a void in the industry in this area. We further design a simple LightControl to enhance the fidelity of instructional image editing. Finally, extensive experiments demonstrate the effectiveness, efficiency, multifunctionality, and transferability of our X2I. The open-source code and checkpoints for X2I can be found at the following link: https://github.com/OPPO-Mente-Lab/X2I.
中文: X2I框架将多模态大语言模型的理解能力高效迁移至文本生成图像模型,使其能够处理多种输入并实现创意图像生成,同时保持高性能和多功能性。
English: The X2I framework efficiently transfers multimodal comprehension abilities from MLLMs to T2I models, enabling them to process diverse inputs like text, images, and audio with minimal performance loss and enhanced creative capabilities.

Authors:Xiangxiang Chu, Renda Li, Yong Wang
Title: USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Abstract:
Recent studies have highlighted the interplay between diffusion models and representation learning. Intermediate representations from diffusion models can be leveraged for downstream visual tasks, while self-supervised vision models can enhance the convergence and generation quality of diffusion models. However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models. Our code will be publicly available at https://github.com/AMAP-ML/USP.
中文摘要:提出的统一自监督预训练(USP)框架通过在VAE潜在空间中进行掩码潜在建模来初始化扩散模型,不仅保持了可比较的理解任务性能,还显著提升了扩散模型的收敛速度与生成质量。
English Summary: The proposed Unified Self-supervised Pretraining (USP) framework overcomes transfer challenges by initializing diffusion models through masked latent modeling in VAE space, achieving competitive understanding performance while significantly boosting diffusion models' convergence speed and generation quality.

Authors:Li weile, Liu Xiao
Title: BlackGoose Rimer: Harnessing RWKV-7 as a Simple yet Superior Replacement for Transformers in Large-Scale Time Series Modeling
Abstract:
Time series models face significant challenges in scaling to handle large and complex datasets, akin to the scaling achieved by large language models (LLMs). The unique characteristics of time series data and the computational demands of model scaling necessitate innovative approaches. While researchers have explored various architectures such as Transformers, LSTMs, and GRUs to address these challenges, we propose a novel solution using RWKV-7, which incorporates meta-learning into its state update mechanism. By integrating RWKV-7's time mix and channel mix components into the transformer-based time series model Timer, we achieve a substantial performance improvement of approximately 1.13 to 43.3x and a 4.5x reduction in training time with 1/23 parameters, all while utilizing fewer parameters. Our code and model weights are publicly available for further research and development at https://github.com/Alic-Li/BlackGoose_Rimer.
中文: 研究者将RWKV-7的元学习架构整合到Timer时序模型中,在减少参数量的同时实现了最高43倍性能提升和4.5倍训练加速。
English: Researchers propose integrating RWKV-7's meta-learning architecture into the Timer model, achieving up to 43x performance gains and 4.5x faster training with significantly fewer parameters.

Authors:Hoang-Thang Ta, Anh Tran
Title: AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning
Abstract:
Kolmogorov-Arnold Networks (KANs) have inspired numerous works exploring their applications across a wide range of scientific problems, with the potential to replace Multilayer Perceptrons (MLPs). While many KANs are designed using basis and polynomial functions, such as B-splines, ReLU-KAN utilizes a combination of ReLU functions to mimic the structure of B-splines and take advantage of ReLU's speed. However, ReLU-KAN is not built for multiple inputs, and its limitations stem from ReLU's handling of negative values, which can restrict feature extraction. To address these issues, we introduce Activation Function-Based Kolmogorov-Arnold Networks (AF-KAN), expanding ReLU-KAN with various activations and their function combinations. This novel KAN also incorporates parameter reduction methods, primarily attention mechanisms and data normalization, to enhance performance on image classification datasets. We explore different activation functions, function combinations, grid sizes, and spline orders to validate the effectiveness of AF-KAN and determine its optimal configuration. In the experiments, AF-KAN significantly outperforms MLP, ReLU-KAN, and other KANs with the same parameter count. It also remains competitive even when using fewer than 6 to 10 times the parameters while maintaining the same network structure. However, AF-KAN requires a longer training time and consumes more FLOPs. The repository for this work is available at https://github.com/hoangthangta/All-KAN.
Chinese: AF-KAN提出了一种新型的科尔莫戈罗夫-阿诺德网络,采用多种激活函数和参数精简方法,在图像分类任务中以更少参数显著优于MLP和其他KAN,但需要更长的训练时间。
English: AF-KAN introduces a novel Kolmogorov-Arnold Network that utilizes various activation functions and parameter reduction methods, significantly outperforming MLPs and other KANs in image classification while requiring fewer parameters but longer training times.

Authors:Xianjie Liu, Keren Fu, Qijun Zhao
Title: Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior
Abstract:
Dichotomous Image Segmentation (DIS) is a high-precision object segmentation task for high-resolution natural images. The current mainstream methods focus on the optimization of local details but overlook the fundamental challenge of modeling the integrity of objects. We have found that the depth integrity-prior implicit in the the pseudo-depth maps generated by Depth Anything Model v2 and the local detail features of image patches can jointly address the above dilemmas. Based on the above findings, we have designed a novel Patch-Depth Fusion Network (PDFNet) for high-precision dichotomous image segmentation. The core of PDFNet consists of three aspects. Firstly, the object perception is enhanced through multi-modal input fusion. By utilizing the patch fine-grained strategy, coupled with patch selection and enhancement, the sensitivity to details is improved. Secondly, by leveraging the depth integrity-prior distributed in the depth maps, we propose an integrity-prior loss to enhance the uniformity of the segmentation results in the depth maps. Finally, we utilize the features of the shared encoder and, through a simple depth refinement decoder, improve the ability of the shared encoder to capture subtle depth-related information in the images. Experiments on the DIS-5K dataset show that PDFNet significantly outperforms state-of-the-art non-diffusion methods. Due to the incorporation of the depth integrity-prior, PDFNet achieves or even surpassing the performance of the latest diffusion-based methods while using less than 11% of the parameters of diffusion-based methods. The source code at https://github.com/Tennine2077/PDFNet
中文: PDFNet提出了一种新颖的块-深度融合网络,通过结合局部图像块细节与深度完整性先验来提升分割精度,在参数量远少于扩散模型的情况下实现了顶尖性能。
English: PDFNet introduces a novel Patch-Depth Fusion Network that enhances segmentation precision by integrating local patch details with depth integrity-priors, achieving state-of-the-art performance with significantly fewer parameters than diffusion-based methods.

Authors:Cheng Hu, Jihao Huang, Wule Mao, Yonghao Fu, Xuemin Chi, Haotong Qin, Nicolas Baumann, Zhitao Liu, Michele Magno, Lei Xie
Title: FSDP: Fast and Safe Data-Driven Overtaking Trajectory Planning for Head-to-Head Autonomous Racing Competitions
Abstract:
Generating overtaking trajectories in autonomous racing is a challenging task, as the trajectory must satisfy the vehicle's dynamics and ensure safety and real-time performance running on resource-constrained hardware. This work proposes the Fast and Safe Data-Driven Planner to address this challenge. Sparse Gaussian predictions are introduced to improve both the computational efficiency and accuracy of opponent predictions. Furthermore, the proposed approach employs a bi-level quadratic programming framework to generate an overtaking trajectory leveraging the opponent predictions. The first level uses polynomial fitting to generate a rough trajectory, from which reference states and control inputs are derived for the second level. The second level formulates a model predictive control optimization problem in the Frenet frame, generating a trajectory that satisfies both kinematic feasibility and safety. Experimental results on the F1TENTH platform show that our method outperforms the State-of-the-Art, achieving an 8.93% higher overtaking success rate, allowing the maximum opponent speed, ensuring a smoother ego trajectory, and reducing 74.04% computational time compared to the Predictive Spliner method. The code is available at: https://github.com/ZJU-DDRX/FSDP.
中文摘要:本研究提出快速安全数据驱动规划器,通过稀疏高斯预测提升对手行为预测效率,并采用双层二次规划框架生成满足动力学约束和安全性的超车轨迹,显著提高成功率和计算性能。
English Summary: This study introduces the Fast and Safe Data-Driven Planner, which enhances autonomous racing overtaking by combining sparse Gaussian predictions for opponent behavior with a bi-level quadratic programming framework to ensure dynamic feasibility, safety, and computational efficiency.

Authors:Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng
Title: GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Abstract:
While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git
中文: GEM作为首个融合心电时间序列、12导联图像与文本的多模态大模型,通过跨模态特征对齐和知识引导实现了可解释的心电分析,显著提升了诊断性能和临床适用性。
English: GEM is a novel multimodal large language model that integrates ECG time series, images, and text to enhance diagnostic accuracy, explainability, and clinical alignment through cross-modal feature extraction and evidence-driven reasoning.

Authors:Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin, Hui Su, Jinlan Fu, Xiaoyu Shen
Title: Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Abstract:
Multimodal Large Language Models (MLLMs) have made significant advancements in recent years, with visual features playing an increasingly critical role in enhancing model performance. However, the integration of multi-layer visual features in MLLMs remains underexplored, particularly with regard to optimal layer selection and fusion strategies. Existing methods often rely on arbitrary design choices, leading to suboptimal outcomes. In this paper, we systematically investigate two core aspects of multi-layer visual feature fusion: (1) selecting the most effective visual layers and (2) identifying the best fusion approach with the language model. Our experiments reveal that while combining visual features from multiple stages improves generalization, incorporating additional features from the same stage typically leads to diminished performance. Furthermore, we find that direct fusion of multi-layer visual features at the input stage consistently yields superior and more stable performance across various configurations. We make all our code publicly available: https://github.com/EIT-NLP/Layer_Select_Fuse_for_MLLM.
Chinese: 本研究系统探索了多模态大语言模型中多层视觉特征的最佳层级选择与融合策略,发现跨阶段特征组合能提升泛化能力,而输入层直接融合则能实现更优且稳定的性能表现。
English: This study systematically explores optimal layer selection and fusion strategies for multi-layer visual features in Multimodal Large Language Models, finding that combining features from different stages enhances generalization while direct input-stage fusion delivers superior performance.

Authors:Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao
Title: DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
Abstract:
Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.
中文: DSGBench是一个针对基于大语言模型的智能体推出的严格评估平台,它通过六种复杂策略游戏和细粒度评分系统,全面检验多维度决策能力,并利用自动化追踪机制深入分析行为模式。
English: DSGBench is introduced as a rigorous evaluation platform for LLM-based agents, featuring six complex strategic games and a fine-grained scoring system to comprehensively assess decision-making capabilities across multiple dimensions, with automated tracking for in-depth behavioral analysis.

Authors:Yuheng Li, Yuxiang Lai, Maria Thor, Deborah Marshall, Zachary Buchwald, David S. Yu, Xiaofeng Yang
Title: Towards Universal Text-driven CT Image Segmentation
Abstract:
Computed tomography (CT) is extensively used for accurate visualization and segmentation of organs and lesions. While deep learning models such as convolutional neural networks (CNNs) and vision transformers (ViTs) have significantly improved CT image analysis, their performance often declines when applied to diverse, real-world clinical data. Although foundation models offer a broader and more adaptable solution, their potential is limited due to the challenge of obtaining large-scale, voxel-level annotations for medical images. In response to these challenges, prompting-based models using visual or text prompts have emerged. Visual-prompting methods, such as the Segment Anything Model (SAM), still require significant manual input and can introduce ambiguity when applied to clinical scenarios. Instead, foundation models that use text prompts offer a more versatile and clinically relevant approach. Notably, current text-prompt models, such as the CLIP-Driven Universal Model, are limited to text prompts already encountered during training and struggle to process the complex and diverse scenarios of real-world clinical applications. Instead of fine-tuning models trained from natural imaging, we propose OpenVocabCT, a vision-language model pretrained on large-scale 3D CT images for universal text-driven segmentation. Using the large-scale CT-RATE dataset, we decompose the diagnostic reports into fine-grained, organ-level descriptions using large language models for multi-granular contrastive learning. We evaluate our OpenVocabCT on downstream segmentation tasks across nine public datasets for organ and tumor segmentation, demonstrating the superior performance of our model compared to existing methods. All code, datasets, and models will be publicly released at https://github.com/ricklisz/OpenVocabCT.
Chinese: OpenVocabCT是一种基于大规模3D CT图像预训练的视觉语言模型,通过利用细粒度诊断报告进行多粒度对比学习,在文本驱动的通用分割任务中表现出优于现有方法的性能。
English: OpenVocabCT is a vision-language model pretrained on large-scale 3D CT images that enables universal text-driven segmentation, outperforming existing methods by leveraging fine-grained diagnostic reports for multi-granular contrastive learning.

Authors:Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li
Title: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Abstract:
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
中文: SmartBench是首个针对中文移动场景下设备端大语言模型的评估基准,涵盖五大功能类别并提供自动化评估标准,旨在弥补实际应用场景中的评估空白。
English: SmartBench is the first benchmark designed to evaluate on-device LLMs in Chinese mobile contexts, covering five key functional categories and providing automated evaluation criteria to address the gap in practical usage scenarios.

Authors:Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang
Title: Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
Abstract:
Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.
中文摘要:本研究通过提出免训练的拉普拉斯视觉提示技术,从预训练模型中提取多层深度信息以解决透明场景中的深度模糊问题,并基于新基准验证了其在三维空间理解中的有效性。
English Summary: This research introduces a training-free Laplacian Visual Prompting technique to address depth ambiguity in transparent scenes by extracting multi-layer depth from pre-trained models, validated through a new benchmark for robust 3D spatial understanding.

Authors:Shan An, Shipeng Dai, Mahrukh Ansari, Yu Liang, Ming Zeng, Konstantinos A. Tsintotas, Changhong Fu, Hong Zhang
Title: ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features
Abstract:
Accurate hand pose estimation is vital in robotics, advancing dexterous manipulation in human-computer interaction. Toward this goal, this paper presents ReJSHand (which stands for Refined Joint and Skeleton Features), a cutting-edge network formulated for real-time hand pose estimation and mesh reconstruction. The proposed framework is designed to accurately predict 3D hand gestures under real-time constraints, which is essential for systems that demand agile and responsive hand motion tracking. The network's design prioritizes computational efficiency without compromising accuracy, a prerequisite for instantaneous robotic interactions. Specifically, ReJSHand comprises a 2D keypoint generator, a 3D keypoint generator, an expansion block, and a feature interaction block for meticulously reconstructing 3D hand poses from 2D imagery. In addition, the multi-head self-attention mechanism and a coordinate attention layer enhance feature representation, streamlining the creation of hand mesh vertices through sophisticated feature mapping and linear transformation. Regarding performance, comprehensive evaluations on the FreiHand dataset demonstrate ReJSHand's computational prowess. It achieves a frame rate of 72 frames per second while maintaining a PA-MPJPE (Position-Accurate Mean Per Joint Position Error) of 6.3 mm and a PA-MPVPE (Position-Accurate Mean Per Vertex Position Error) of 6.4 mm. Moreover, our model reaches scores of 0.756 for F@05 and 0.984 for F@15, surpassing modern pipelines and solidifying its position at the forefront of robotic hand pose estimators. To facilitate future studies, we provide our source code at ~\url{https://github.com/daishipeng/ReJSHand}.
中文: 本文提出ReJSHand网络,通过实时三维手势估计和网格重构技术,在FreiHand数据集上以72帧/秒的速度实现高精度追踪,显著提升了机器人交互性能。
English: This paper introduces ReJSHand, an efficient network for real-time 3D hand pose estimation and mesh reconstruction that achieves high accuracy with 72 FPS performance on the FreiHand dataset.

Authors:Beyza Kalkanli, Tales Imbiriba, Stratis Ioannidis, Deniz Erdogmus, Jennifer Dy
Title: Dependency-aware Maximum Likelihood Estimation for Active Learning
Abstract:
Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators. In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6%, 8.6%, and 10.5% for k=1, k=5, and k=10 respectively, after collecting the first 100 samples, where entropy is the acquisition function and k is the query batch size acquired at every active learning cycle. Our implementation is publicly available at: https://github.com/neu-spiral/DMLEforAL
主动学习旨在高效选择样本进行标注,但传统最大似然估计忽略了顺序选择样本间的依赖性;我们提出的DMLE方法通过引入依赖性修正这一问题,在多个基准数据集上实现了更早且更高的性能提升。
Active learning efficiently selects samples for labeling, but traditional MLE ignores the dependencies among these sequentially selected samples; our proposed DMLE method corrects this by incorporating dependency awareness, leading to earlier and higher performance gains across multiple datasets.

Authors:Nils Graef, Andrew Wasielewski
Title: Slim attention: cut your context memory in half without loss -- K-cache is all you need for MHA
Abstract:
Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows. Slim attention is an exact, mathematically identical implementation of the standard attention mechanism and therefore doesn't compromise model accuracy. In other words, slim attention losslessly compresses the context memory by a factor of 2. For encoder-decoder transformers, the context memory size can be reduced even further: For the Whisper models for example, slim attention reduces the context memory by 8x, which can speed up token generation by 5x for batch size 64 for example. And for the T5-11B model for example, the memory can be reduced by 32x because its MHA projection dimension is larger than the embedding dimension. See https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks, and https://www.youtube.com/watch?v=uVtk3B6YO4Y for this paper's YouTube video.
Chinese: Slim注意力是一种内存高效的多头注意力实现,可将标准Transformer的上下文内存减少2倍,特定模型最多减少32倍,在保持完全准确性的同时显著加速推理。
English: Slim attention is a memory-efficient implementation of multi-head attention that reduces context memory size by up to 2x for standard transformers and up to 32x for certain models while maintaining full accuracy and accelerating inference.

Authors:Yiming Li, Kaiying Yan, Shuo Shao, Tongqing Zhai, Shu-Tao Xia, Zhan Qin, Dacheng Tao
Title: CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking
Abstract:
With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW), enabling dataset owners to determine whether a suspicious third-party model has been trained on a protected dataset under a black-box setting. The CBW method consists of two key stages: dataset watermarking and ownership verification. During watermarking, we implant multiple trigger patterns in the dataset to make similar samples (measured by their feature similarities) close to the same trigger while dissimilar samples are near different triggers. This ensures that any model trained on the watermarked dataset exhibits specific misclassification behaviors when exposed to trigger-embedded inputs. To verify dataset ownership, we design a hypothesis-test-based framework that statistically evaluates whether a suspicious model exhibits the expected backdoor behavior. We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks. The code for reproducing main experiments is available at https://github.com/Radiant0726/CBW
中文: 本文提出了一种基于聚类的后门水印方法,通过在数据集中植入触发模式并利用假设检验来验证黑盒场景下第三方模型是否未经授权使用了受保护数据集。
English: This paper introduces a clustering-based backdoor watermark (CBW) method for verifying dataset ownership in speaker verification, which implants triggers during dataset watermarking and uses hypothesis testing to detect unauthorized model usage under black-box settings.

Authors:Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Hae Won Park, Samir Tulebaev, Cynthia Breazeal
Title: Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Abstract:
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.
中文: 医学基础模型因产生误导性医疗内容的幻觉而存在可靠性问题,本研究通过提出分类法、基准测试和临床医生调查,强调需要检测策略和伦理指南以确保患者安全。
English: Foundation models in medicine face reliability issues due to hallucinations, which generate misleading medical content, and this study proposes a taxonomy, benchmarks models, and surveys clinicians to highlight the need for detection strategies and ethical guidelines to ensure patient safety.

Authors:Yihang Wu, Ahmad Chaddad, Christian Desrosiers, Tareef Daqqaq, Reem Kateb
Title: FAA-CLIP: Federated Adversarial Adaptation of CLIP
Abstract:
Despite the remarkable performance of vision language models (VLMs) such as Contrastive Language Image Pre-training (CLIP), the large size of these models is a considerable obstacle to their use in federated learning (FL) systems where the parameters of local client models need to be transferred to a global server for aggregation. Another challenge in FL is the heterogeneity of data from different clients, which affects the generalization performance of the solution. In addition, natural pre-trained VLMs exhibit poor generalization ability in the medical datasets, suggests there exists a domain gap. To solve these issues, we introduce a novel method for the Federated Adversarial Adaptation (FAA) of CLIP. Our method, named FAA-CLIP, handles the large communication costs of CLIP using a light-weight feature adaptation module (FAM) for aggregation, effectively adapting this VLM to each client's data while greatly reducing the number of parameters to transfer. By keeping CLIP frozen and only updating the FAM parameters, our method is also computationally efficient. Unlike existing approaches, our FAA-CLIP method directly addresses the problem of domain shifts across clients via a domain adaptation (DA) module. This module employs a domain classifier to predict if a given sample is from the local client or the global server, allowing the model to learn domain-invariant representations. Extensive experiments on six different datasets containing both natural and medical images demonstrate that FAA-CLIP can generalize well on both natural and medical datasets compared to recent FL approaches. Our codes are available at https://github.com/AIPMLab/FAA-CLIP.
Chinese: FAA-CLIP通过轻量级特征适配模块降低联邦学习中的通信成本,并利用域适配模块处理数据异构性和领域差异,在自然与医学数据集上均实现了优异的泛化性能。
English: FAA-CLIP introduces a lightweight feature adaptation module to reduce communication costs in federated learning and employs a domain adaptation module to handle data heterogeneity and domain shifts, achieving superior generalization on both natural and medical datasets.

Authors:Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mostafa Rifat Tazwar, Md Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Title: CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
Abstract:
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL - Context-driven Sequential TRansfer Learning - achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at https://github.com/fahmidahossain/Report_Summarization.
中文: 本研究提出CSTRL模型,通过Fisher矩阵正则化的顺序迁移学习方法解决医学报告摘要中的专业术语和临床语境难题,在MIMIC-CXR和Open-I数据集上实现了最先进的性能,各项评估指标显著提升。
English: This study introduces CSTRL, a context-driven sequential transfer learning model that overcomes challenges in medical report summarization by using Fisher matrix regularization to prevent knowledge loss, achieving state-of-the-art performance on MIMIC-CXR and Open-I datasets with significant improvements in BLEU and ROUGE scores.

Authors:Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mostafa Rifat Tazwar, Md Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Title: CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
Abstract:
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL - Context-driven Sequential TRansfer Learning - achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at https://github.com/fahmidahossain/Report_Summarization.
中文: 本研究提出CSTRL模型,通过Fisher矩阵正则化的顺序迁移学习方法解决医学报告摘要中的专业术语和临床语境难题,在MIMIC-CXR和Open-I数据集上实现了最先进的性能,各项评估指标显著提升。
English: This study introduces CSTRL, a context-driven sequential transfer learning model that overcomes challenges in medical report summarization by using Fisher matrix regularization to prevent knowledge loss, achieving state-of-the-art performance on MIMIC-CXR and Open-I datasets with significant improvements in BLEU and ROUGE scores.

Authors:Jillian Fisher, Ruth E. Appel, Chan Young Park, Yujin Potter, Liwei Jiang, Taylor Sorensen, Shangbin Feng, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi
Title: Political Neutrality in AI Is Impossible- But Here Is How to Approximate It
Abstract:
AI systems often exhibit political bias, influencing users' opinions and decisions. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
中文: 本文主张AI的绝对政治中立不可实现,但提出包含八种技术的"近似中立"框架以减少偏见并促进平衡交互,同时展示了该框架在大型语言模型上的应用评估。
English: This paper argues that absolute political neutrality in AI is unattainable but proposes an "approximation" framework with eight techniques to mitigate bias and promote balanced interactions, demonstrating its application on large language models.

Authors:Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, Wei Yin
Title: GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Abstract:
We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\cite{Dauner2024_navsim}, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at https://github.com/YvanYin/GoalFlow.
Chinese: GoalFlow是一种端到端的自动驾驶方法,通过引入目标点约束和新颖的评分机制生成高质量多模态轨迹,仅需单步去噪即可实现最优性能。
English: GoalFlow is an end-to-end autonomous driving method that generates high-quality multimodal trajectories by introducing goal point constraints and a novel scoring mechanism, achieving state-of-the-art performance with only a single denoising step.

Authors:Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, Wei Yin
Title: GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Abstract:
We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\cite{Dauner2024_navsim}, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at https://github.com/YvanYin/GoalFlow.
Chinese: GoalFlow是一种端到端的自动驾驶方法,通过引入目标点约束和新颖的评分机制生成高质量多模态轨迹,仅需单步去噪即可实现最优性能。
English: GoalFlow is an end-to-end autonomous driving method that generates high-quality multimodal trajectories by introducing goal point constraints and a novel scoring mechanism, achieving state-of-the-art performance with only a single denoising step.

Authors:Zhenxuan Zhang, Hongjie Wu, Jiahao Huang, Baihong Xie, Zhifan Gao, Junxian Du, Pete Lally, Guang Yang
Title: Task-oriented Uncertainty Collaborative Learning for Label-Efficient Brain Tumor Segmentation
Abstract:
Multi-contrast magnetic resonance imaging (MRI) plays a vital role in brain tumor segmentation and diagnosis by leveraging complementary information from different contrasts. Each contrast highlights specific tumor characteristics, enabling a comprehensive understanding of tumor morphology, edema, and pathological heterogeneity. However, existing methods still face the challenges of multi-level specificity perception across different contrasts, especially with limited annotations. These challenges include data heterogeneity, granularity differences, and interference from redundant information. To address these limitations, we propose a Task-oriented Uncertainty Collaborative Learning (TUCL) framework for multi-contrast MRI segmentation. TUCL introduces a task-oriented prompt attention (TPA) module with intra-prompt and cross-prompt attention mechanisms to dynamically model feature interactions across contrasts and tasks. Additionally, a cyclic process is designed to map the predictions back to the prompt to ensure that the prompts are effectively utilized. In the decoding stage, the TUCL framework proposes a dual-path uncertainty refinement (DUR) strategy which ensures robust segmentation by refining predictions iteratively. Extensive experimental results on limited labeled data demonstrate that TUCL significantly improves segmentation accuracy (88.2\% in Dice and 10.853 mm in HD95). It shows that TUCL has the potential to extract multi-contrast information and reduce the reliance on extensive annotations. The code is available at: https://github.com/Zhenxuan-Zhang/TUCL_BrainSeg.
中文: 提出的任务导向不确定性协同学习(TUCL)框架通过动态建模特征交互和迭代优化预测,有效提升了多对比度MRI脑肿瘤分割的精度,并降低了对大量标注数据的依赖。
English: The proposed Task-oriented Uncertainty Collaborative Learning (TUCL) framework enhances multi-contrast MRI brain tumor segmentation by dynamically modeling feature interactions and refining predictions, achieving higher accuracy with limited annotations.

Authors:Zhongyi Shui, Ruizhe Guo, Honglin Li, Yuxuan Sun, Yunlong Zhang, Chenglu Zhu, Jiatong Cai, Pingyi Chen, Yanzhou Su, Lin Yang
Title: Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images
Abstract:
Nucleus detection in histopathology whole slide images (WSIs) is crucial for a broad spectrum of clinical applications. Current approaches for nucleus detection in gigapixel WSIs utilize a sliding window methodology, which overlooks boarder contextual information (eg, tissue structure) and easily leads to inaccurate predictions. To address this problem, recent studies additionally crops a large Filed-of-View (FoV) region around each sliding window to extract contextual features. However, such methods substantially increases the inference latency. In this paper, we propose an effective and efficient context-aware nucleus detection algorithm. Specifically, instead of leveraging large FoV regions, we aggregate contextual clues from off-the-shelf features of historically visited sliding windows. This design greatly reduces computational overhead. Moreover, compared to large FoV regions at a low magnification, the sliding window patches have higher magnification and provide finer-grained tissue details, thereby enhancing the detection accuracy. To further improve the efficiency, we propose a grid pooling technique to compress dense feature maps of each patch into a few contextual tokens. Finally, we craft OCELOT-seg, the first benchmark dedicated to context-aware nucleus instance segmentation. Code, dataset, and model checkpoints will be available at https://github.com/windygoo/PathContext.
Chinese: 本文提出了一种高效的上下文感知细胞核检测算法,通过聚合历史访问滑动窗口的现成特征来整合上下文线索,不仅降低了计算开销,还利用高倍率切片和网格池化技术提升了检测精度。
English: The paper introduces an efficient context-aware nucleus detection algorithm that aggregates contextual clues from previously visited sliding windows, reducing computational overhead and improving accuracy by utilizing higher magnification patches and a grid pooling technique.

Authors:Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Ioannis Patras
Title: AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
Abstract:
Recent advances in generative models have sparked research on improving model fairness with AI-generated data. However, existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy. Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. We investigate a fine-tuning paradigm starting from a biased model initially trained on real-world data without demographic annotations. This model is then fine-tuned using unbiased synthetic data generated by a state-of-the-art diffusion model to improve its fairness. Two key challenges are identified in this fine-tuning paradigm, 1) the low quality of synthetic data, which can still happen even with advanced generative models, and 2) the domain and bias gap between real and synthetic data. To address the limitation of synthetic data quality, we propose Contextual Synthetic Data Generation (CSDG) to generate data using a text-to-image diffusion model (T2I) with prompts generated by a context-aware LLM, ensuring both data diversity and control of bias in synthetic data. To resolve domain and bias shifts, we introduce a novel selective fine-tuning scheme in which only model parameters more sensitive to bias and less sensitive to domain shift are updated. Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
中文: 本文提出AIM-Fair方法,通过使用上下文感知提示生成的高质量合成数据对偏置模型进行选择性参数微调,在保持模型效用的同时显著提升了算法公平性。
English: This paper introduces AIM-Fair, a method that enhances algorithmic fairness by fine-tuning biased models with high-quality synthetic data generated via context-aware prompts and a selective parameter update scheme, effectively improving fairness without compromising utility.

Authors:Yu Zhang, Shutong Qiao, Jiaqi Zhang, Tzu-Heng Lin, Chen Gao, Yong Li
Title: A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
Abstract:
Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, recommender systems and search (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Recent advances in large language models (LLMs) have demonstrated capabilities that surpass human performance in various language-related tasks and exhibit general understanding, reasoning, and decision-making abilities. This paper explores the transformative potential of LLM agents in enhancing recommender and search systems. We discuss the motivations and roles of LLM agents, and establish a classification framework to elaborate on the existing research. We highlight the immense potential of LLM agents in addressing current challenges in recommendation and search, providing insights into future research directions. This paper is the first to systematically review and classify the research on LLM agents in these domains, offering a novel perspective on leveraging this advanced AI technology for information retrieval. To help understand the existing works, we list the existing papers on LLM agent based recommendation and search at this link: https://github.com/tsinghua-fib-lab/LLM-Agent-for-Recommendation-and-Search.
中文: 本文探讨了大语言模型智能体如何通过解决信息过载问题来革新推荐与搜索系统,建立了分类框架并指明了未来研究方向。
English: This paper explores how large language model agents can revolutionize recommender and search systems by addressing information overload challenges, establishing a classification framework and highlighting future research directions.

Authors:Hiroki Tomioka, Katsuma Inoue, Yasuo Kuniyoshi, Kohei Nakajima
Title: Backpropagation through Soft Body: Investigating Information Processing in Brain-Body Coupling Systems
Abstract:
Animals achieve sophisticated behavioral control through dynamic coupling of the brain, body, and environment. Accordingly, the co-design approach, in which both the controllers and the physical properties are optimized simultaneously, has been suggested for generating refined agents without designing each component separately. In this study, we aim to reveal how the function of the information processing is distributed between brains and bodies while applying the co-design approach. Using a framework called ``backpropagation through soft body," we developed agents to perform specified tasks and analyzed their mechanisms. The tasks included classification and corresponding behavioral association, nonlinear dynamical system emulation, and autonomous behavioral generation. In each case, our analyses revealed reciprocal relationships between the brains and bodies. In addition, we show that optimized brain functionalities can be embedded into bodies using physical reservoir computing techniques. Our results pave the way for efficient designs of brain--body coupling systems.
中文:动物通过大脑、身体和环境的动态耦合实现复杂行为控制,而协同设计方法同时优化控制器与物理特性以生成精细化智能体,揭示了大脑与身体间的互惠关系,为高效设计脑体耦合系统开辟了新途径。
English: Animals achieve sophisticated behaviors through dynamic brain-body-environment interactions, and the co-design approach simultaneously optimizes both controllers and physical properties to create refined agents, revealing reciprocal brain-body relationships and enabling efficient brain-body coupling system designs.

Authors:Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, Yiling Xu
Title: D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. However, their implicit formulation limits both interpretability and efficacy, undermining their practicality as a comprehensive solution. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key objectives: 1) improved efficiency while delivering superior quality; 2) enhanced scalability and interpretability; and 3) increased friendliness for downstream tasks. Specifically, we initially divide the video sequence into fixed-length Groups of Pictures (GoP) to allow parallel training and linear scalability with video length. For each GoP, D2GV represents video frames by applying differentiable rasterization to 2D Gaussians, which are deformed from a canonical space into their corresponding timestamps. Notably, leveraging efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds exceeding 400 FPS, while delivering quality that matches or surpasses state-of-the-art INRs. Moreover, we incorporate a learnable pruning and quantization strategy to streamline D2GV into a more compact representation. We demonstrate D2GV's versatility in tasks including video interpolation, inpainting and denoising, underscoring its potential as a promising solution for video representation. Code is available at: https://github.com/Evan-sudo/D2GV.
中文: 提出的D2GV视频表示方法通过可变形二维高斯溅射克服了隐式神经表示的局限性,在多种视频任务中实现了超过400 FPS的高效处理、优越的可扩展性,同时保持了卓越的质量表现。
English: The proposed D2GV video representation method overcomes the limitations of Implicit Neural Representations by using deformable 2D Gaussian splatting, achieving superior efficiency, scalability, and performance exceeding 400 FPS while maintaining high quality across various video tasks.

Authors:Prashant K. Jha
Title: From Theory to Application: A Practical Introduction to Neural Operators in Scientific Computing
Abstract:
This focused review explores a range of neural operator architectures for approximating solutions to parametric partial differential equations (PDEs), emphasizing high-level concepts and practical implementation strategies. The study covers foundational models such as Deep Operator Networks (DeepONet), Principal Component Analysis-based Neural Networks (PCANet), and Fourier Neural Operators (FNO), providing comparative insights into their core methodologies and performance. These architectures are demonstrated on two classical linear parametric PDEs: the Poisson equation and linear elastic deformation. Beyond forward problem-solving, the review delves into applying neural operators as surrogates in Bayesian inference problems, showcasing their effectiveness in accelerating posterior inference while maintaining accuracy. The paper concludes by discussing current challenges, particularly in controlling prediction accuracy and generalization. It outlines emerging strategies to address these issues, such as residual-based error correction and multi-level training. This review can be seen as a comprehensive guide to implementing neural operators and integrating them into scientific computing workflows.
中文: 本综述探讨了用于求解参数偏微分方程的神经算子架构,比较了DeepONet和FNO等方法在经典方程及贝叶斯推断中的应用,并针对精度和泛化挑战提出了残差修正等多层次解决策略。
English: This review examines neural operator architectures for solving parametric PDEs, comparing methods like DeepONet and FNO on classical equations and their use in Bayesian inference, while addressing challenges in accuracy and generalization with emerging strategies.

Authors:Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Title: Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Abstract:
Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: $\\\href{https://github.com/maybenotime/RAG-SpuriousFeatures}{https://github.com/maybenotime/RAG-SpuriousFeatures}$.
中文: 本研究通过系统分类和实证评估,揭示了RAG系统中普遍存在的虚假特征问题,并发现这些特征具有既可能有害也可能有益的双重性质。
English: This study identifies spurious features as a widespread robustness issue in RAG systems, revealing their dual nature of being both harmful and beneficial through comprehensive taxonomy and empirical evaluation.

Authors:Libo Zhu, Haotong Qin, Kaicheng Yang, Wenbo Li, Yong Guo, Yulun Zhang, Susanto Rahardja, Xiaokang Yang
Title: QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution
Abstract:
One-step diffusion-based image super-resolution (OSDSR) models are showing increasingly superior performance nowadays. However, although their denoising steps are reduced to one and they can be quantized to 8-bit to reduce the costs further, there is still significant potential for OSDSR to quantize to lower bits. To explore more possibilities of quantized OSDSR, we propose an efficient method, Quantization via reverse-module and timestep-retraining for OSDSR, named QArtSR. Firstly, we investigate the influence of timestep value on the performance of quantized models. Then, we propose Timestep Retraining Quantization (TRQ) and Reversed Per-module Quantization (RPQ) strategies to calibrate the quantized model. Meanwhile, we adopt the module and image losses to update all quantized modules. We only update the parameters in quantization finetuning components, excluding the original weights. To ensure that all modules are fully finetuned, we add extended end-to-end training after per-module stage. Our 4-bit and 2-bit quantization experimental results indicate that QArtSR obtains superior effects against the recent leading comparison methods. The performance of 4-bit QArtSR is close to the full-precision one. Our code will be released at https://github.com/libozhu03/QArtSR.
中文: 提出的QArtSR方法通过时间步重训练和反向逐模块量化优化了一步扩散图像超分辨率模型,在4位和2位量化中实现了接近全精度的卓越性能。
English: The proposed QArtSR method enhances one-step diffusion-based image super-resolution (OSDSR) models through timestep retraining and reversed per-module quantization, achieving superior 4-bit and 2-bit performance that rivals full-precision models.

Authors:Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang
Title: Novel Object 6D Pose Estimation with a Single Reference View
Abstract:
Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.
中文: 提出的SinRef-6D方法通过基于状态空间模型的迭代点对点对齐,仅需单张参考视图即可实现新物体的6D姿态估计,在无需重新训练或CAD模型的情况下达到了与基于CAD方法相当的性能。
English: The proposed SinRef-6D method enables novel object 6D pose estimation using only a single reference view through iterative point-wise alignment with state space models, achieving performance comparable to CAD-based methods without requiring retraining or CAD models.

Authors:Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang
Title: Novel Object 6D Pose Estimation with a Single Reference View
Abstract:
Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in a common coordinate system based on state space models (SSMs). Specifically, iterative object-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.
中文: 提出的SinRef-6D方法通过基于状态空间模型的迭代点对点对齐,仅需单张参考视图即可实现新物体的6D姿态估计,在无需重新训练或CAD模型的情况下达到了与基于CAD方法相当的性能。
English: The proposed SinRef-6D method enables novel object 6D pose estimation using only a single reference view through iterative point-wise alignment with state space models, achieving performance comparable to CAD-based methods without requiring retraining or CAD models.

Authors:Xiaobei Zhao, Xiangrong Zeng, Yihang Ma, Pengjin Tang, Xiang Li
Title: TomatoScanner: phenotyping tomato fruit based on only RGB image
Abstract:
In tomato greenhouse, phenotypic measurement is meaningful for researchers and farmers to monitor crop growth, thereby precisely control environmental conditions in time, leading to better quality and higher yield. Traditional phenotyping mainly relies on manual measurement, which is accurate but inefficient, more importantly, endangering the health and safety of people. Several studies have explored computer vision-based methods to replace manual phenotyping. However, the 2D-based need extra calibration, or cause destruction to fruit, or can only measure limited and meaningless traits. The 3D-based need extra depth camera, which is expensive and unacceptable for most farmers. In this paper, we propose a non-contact tomato fruit phenotyping method, titled TomatoScanner, where RGB image is all you need for input. First, pixel feature is extracted by instance segmentation of our proposed EdgeYOLO with preprocessing of individual separation and pose correction. Second, depth feature is extracted by depth estimation of Depth Pro. Third, pixel and depth feature are fused to output phenotype results in reality. We establish self-built Tomato Phenotype Dataset to test TomatoScanner, which achieves excellent phenotyping on width, height, vertical area and volume, with median relative error of 5.63%, 7.03%, -0.64% and 37.06%, respectively. We propose and add three innovative modules - EdgeAttention, EdgeLoss and EdgeBoost - into EdgeYOLO, to enhance the segmentation accuracy on edge portion. Precision and mean Edge Error greatly improve from 0.943 and 5.641% to 0.986 and 2.963%, respectively. Meanwhile, EdgeYOLO keeps lightweight and efficient, with 48.7 M weights size and 76.34 FPS. Codes and datasets: https://github.com/AlexTraveling/TomatoScanner.
中文: 本文提出的TomatoScanner方法仅需RGB图像,通过实例分割与深度估计实现非接触式番茄表型测量,在保持轻量化的同时显著提升了边缘分割精度,且无需昂贵设备。
English: This paper introduces TomatoScanner, a non-contact method using only RGB images to accurately measure tomato fruit phenotypes through instance segmentation and depth estimation, achieving low error rates without expensive equipment.

Authors:Raphael Trumpp, Ansgar Schäfftlein, Mirco Theile, Marco Caccamo
Title: Impoola: The Power of Average Pooling for Image-Based Deep Reinforcement Learning
Abstract:
As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network's translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor. We make our code available at https://github.com/raphajaner/impoola.
中文摘要:该研究提出Impoola-CNN这一改进的深度强化学习图像编码器,通过在Impala-CNN中用全局平均池化替代展平操作,以更高效的网络设计而非单纯扩大模型规模,在Procgen基准测试中实现了更优的性能和泛化能力。
English Summary: The study introduces Impoola-CNN, an improved image encoder for deep reinforcement learning that replaces flattening with global average pooling in Impala-CNN, achieving superior performance and generalization on the Procgen Benchmark through more efficient network design rather than simply increasing model size.

Authors:Juan Miguel Valverde, Maja Østergaard, Adrian Rodriguez-Palomo, Peter Alling Strange Vibe, Nina Kølln Wittig, Henrik Birkedal, Anders Bjorholm Dahl
Title: Disconnect to Connect: A Data Augmentation Method for Improving Topology Accuracy in Image Segmentation
Abstract:
Accurate segmentation of thin, tubular structures (e.g., blood vessels) is challenging for deep neural networks. These networks classify individual pixels, and even minor misclassifications can break the thin connections within these structures. Existing methods for improving topology accuracy, such as topology loss functions, rely on very precise, topologically-accurate training labels, which are difficult to obtain. This is because annotating images, especially 3D images, is extremely laborious and time-consuming. Low image resolution and contrast further complicates the annotation by causing tubular structures to appear disconnected. We present CoLeTra, a data augmentation strategy that integrates to the models the prior knowledge that structures that appear broken are actually connected. This is achieved by creating images with the appearance of disconnected structures while maintaining the original labels. Our extensive experiments, involving different architectures, loss functions, and datasets, demonstrate that CoLeTra leads to segmentations topologically more accurate while often improving the Dice coefficient and Hausdorff distance. CoLeTra's hyper-parameters are intuitive to tune, and our sensitivity analysis shows that CoLeTra is robust to changes in these hyper-parameters. We also release a dataset specifically suited for image segmentation methods with a focus on topology accuracy. CoLetra's code can be found at https://github.com/jmlipman/CoLeTra.
中文: CoLeTra是一种数据增强策略,通过使用看似断开但实际保持连接标签的图像进行训练,提高了深度学习网络对管状结构分割的拓扑准确性,无需精确标注即可获得更优的分割效果。
English: CoLeTra is a data augmentation technique that enhances the topological accuracy of tubular structure segmentation in deep neural networks by training with intentionally disconnected-looking images while preserving correct connectivity labels, achieving better results without requiring precise annotations.

Authors:Haotian Hu, Jingwei Xu, Fanyi Wang, Toyota Li, Yaonong Wang, Laifeng Hu, Zhiwang Zhang
Title: FastMap: Fast Queries Initialization Based Vectorized HD Map Reconstruction Framework
Abstract:
Reconstruction of high-definition maps is a crucial task in perceiving the autonomous driving environment, as its accuracy directly impacts the reliability of prediction and planning capabilities in downstream modules. Current vectorized map reconstruction methods based on the DETR framework encounter limitations due to the redundancy in the decoder structure, necessitating the stacking of six decoder layers to maintain performance, which significantly hampers computational efficiency. To tackle this issue, we introduce FastMap, an innovative framework designed to reduce decoder redundancy in existing approaches. FastMap optimizes the decoder architecture by employing a single-layer, two-stage transformer that achieves multilevel representation capabilities. Our framework eliminates the conventional practice of randomly initializing queries and instead incorporates a heatmap-guided query generation module during the decoding phase, which effectively maps image features into structured query vectors using learnable positional encoding. Additionally, we propose a geometry-constrained point-to-line loss mechanism for FastMap, which adeptly addresses the challenge of distinguishing highly homogeneous features that often arise in traditional point-to-point loss computations. Extensive experiments demonstrate that FastMap achieves state-of-the-art performance in both nuScenes and Argoverse2 datasets, with its decoder operating 3.2 faster than the baseline. Code and more demos are available at https://github.com/hht1996ok/FastMap.
Chinese: FastMap提出了一种创新框架,通过热图引导的查询生成模块和几何约束损失机制,有效减少矢量地图重建中的解码器冗余,在保持顶尖性能的同时将解码速度提升了3.2倍。
English: FastMap introduces a novel framework that reduces decoder redundancy in vectorized map reconstruction by employing a heatmap-guided query generation module and a geometry-constrained loss, achieving state-of-the-art performance with a 3.2 times faster decoder.

Authors:Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng
Title: Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Abstract:
Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出Linear-MoE系统,通过整合线性序列建模的线性复杂度优势与混合专家的稀疏激活特性,在多种基准测试中实现了高效训练与优异性能。
English: This paper introduces Linear-MoE, a production-level system that integrates Linear Sequence Modeling for linear-complexity sequence processing and Mixture-of-Experts for sparse activation, achieving efficient training and competitive performance across various benchmarks.

Authors:Run He, Di Fang, Yicheng Xu, Yawen Cui, Ming Li, Cen Chen, Ziqian Zeng, Huiping Zhuang
Title: Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning
Abstract:
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate a classifier reconstruction process. This reconstruction exploits previous in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, on various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods. Our codes are available at https://github.com/RHe502/ICML25-DPCR.
中文: 提出的双投影偏移估计与分类器重构(DPCR)方法通过双投影估计语义偏移并利用岭回归重构分类器,有效解决了无范例类增量学习中的语义偏移和决策偏差问题,在多个数据集上实现了最优性能。
English: The proposed Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach effectively addresses semantic shift and decision bias in exemplar-free class-incremental learning by estimating shifts through dual projections and reconstructing classifiers using ridge regression, achieving superior performance across multiple datasets.

Authors:Weiyu Ma, Yuqian Fu, Zecheng Zhang, Bernard Ghanem, Guohao Li
Title: AVA: Attentive VLM Agent for Mastering StarCraft II
Abstract:
We introduce Attentive VLM Agent (AVA), a multimodal StarCraft II agent that aligns artificial agent perception with the human gameplay experience. Traditional frameworks such as SMAC rely on abstract state representations that diverge significantly from human perception, limiting the ecological validity of agent behavior. Our agent addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay. The AVA architecture consists of three integrated components: (1) a vision-language model enhanced with specialized self-attention mechanisms for strategic unit targeting and battlefield assessment, (2) a retrieval-augmented generation system that leverages domain-specific StarCraft II knowledge to inform tactical decisions, and (3) a dynamic role-based task distribution system that enables coordinated multi-agent behavior. The experimental evaluation in our proposed AVACraft environment, which contains 21 multimodal StarCraft II scenarios, demonstrates that AVA powered by foundation models (specifically Qwen-VL and GPT-4o) can execute complex tactical maneuvers without explicit training, achieving comparable performance to traditional MARL methods that require substantial training iterations. This work establishes a foundation for developing human-aligned StarCraft II agents and advances the broader research agenda of multimodal game AI. Our implementation is available at https://github.com/camel-ai/VLM-Play-StarCraft2.
Chinese: 注意力视觉语言模型代理(AVA)是一种多模态《星际争霸II》代理,通过结合RGB视觉输入和自然语言观察来模拟人类认知,利用视觉语言模型和检索增强生成系统,无需显式训练即可实现与传统方法相媲美的性能。
English: The Attentive VLM Agent (AVA) is a multimodal StarCraft II agent that uses RGB visual inputs and natural language observations to align with human perception, integrating vision-language models and retrieval-augmented generation to achieve competitive performance without explicit training.

Authors:Zhenxuan Zhang, Kinhei Lee, Peiyuan Jing, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang
Title: GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation
Abstract:
Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Early overlap-based methods focus on textual matches between predicted and ground-truth entities but miss fine-grained clinical details (e.g., anatomical location, severity). Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based approaches further lack interpretable reasoning steps, making it hard to assess or trust their behavior in safety-critical settings. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = $0.69$ for ReXVal dataset and Kendall coefficient = $0.45$ for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan-Zhang/GEMA_score.
中文:本文提出GEMA-Score这一新型多智能体评估框架,通过客观量化临床可靠性与主观评价报告质量,在医疗报告生成任务中实现了与专家评估的最佳吻合度。
English: This paper introduces GEMA-Score, a novel multi-agent evaluation framework that objectively quantifies clinical reliability and subjectively assesses report quality, achieving superior alignment with expert judgments in medical report generation.

Authors:Zhenxuan Zhang, Peiyuan Jing, Coraline Beitone, Jiahao Huang, Zhifan Gao, Guang Yang, Pete Lally
Title: Pretext Task Adversarial Learning for Unpaired Low-field to Ultra High-field MRI Synthesis
Abstract:
Given the scarcity and cost of high-field MRI, the synthesis of high-field MRI from low-field MRI holds significant potential when there is limited data for training downstream tasks (e.g. segmentation). Low-field MRI often suffers from a reduced signal-to-noise ratio (SNR) and spatial resolution compared to high-field MRI. However, synthesizing high-field MRI data presents challenges. These involve aligning image features across domains while preserving anatomical accuracy and enhancing fine details. To address these challenges, we propose a Pretext Task Adversarial (PTA) learning framework for high-field MRI synthesis from low-field MRI data. The framework comprises three processes: (1) The slice-wise gap perception (SGP) network aligns the slice inconsistencies of low-field and high-field datasets based on contrastive learning. (2) The local structure correction (LSC) network extracts local structures by restoring the locally rotated and masked images. (3) The pretext task-guided adversarial training process introduces additional supervision and incorporates a discriminator to improve image realism. Extensive experiments on low-field to ultra high-field task demonstrate the effectiveness of our method, achieving state-of-the-art performance (16.892 in FID, 1.933 in IS, and 0.324 in MS-SSIM). This enables the generation of high-quality high-field-like MRI data from low-field MRI data to augment training datasets for downstream tasks. The code is available at: https://github.com/Zhenxuan-Zhang/PTA4Unpaired_HF_MRI_SYN.
中文: 提出的前置任务对抗(PTA)学习框架通过特征对齐和细节增强,有效从低场MRI数据合成高场MRI,实现了顶尖性能,为下游任务扩充训练数据集。
English: The proposed Pretext Task Adversarial (PTA) learning framework effectively synthesizes high-field-like MRI from low-field data by aligning features and enhancing details, achieving state-of-the-art performance to augment training datasets for downstream tasks.

Authors:Nikolai Ilinykh, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Title: Coreference as an indicator of context scope in multimodal narrative
Abstract:
We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality. Materials, metrics, and code for our study are available at https://github.com/GU-CLASP/coreference-context-scope.
中文摘要:大型多模态模型在视觉叙事任务中指代表达分布上与人类存在显著差异,难以有效追踪混合实体指代,尽管生成质量有所提升。
English Summary: Large multimodal models differ from humans in managing coreference distribution during visual storytelling, as they struggle with tracking mixed entity references despite improved generation quality.

Authors:Souhail Hadgi, Luca Moschella, Andrea Santilli, Diego Gomez, Qixing Huang, Emanuele RodolÃ, Simone Melzi, Maks Ovsjanikov
Title: Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
Abstract:
Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations. Our code and weights are available at https://github.com/Souhail-01/3d-text-alignment
中文: 研究表明,三维与文本编码器的简单后训练对齐效果有限,但将特征投影到精选的低维子空间可显著提升对齐质量与任务性能,同时揭示了语义与几何数据表征的分离特性。
English: This study demonstrates that naive post-training alignment of 3D and text encoders yields limited results, but projecting features onto carefully selected lower-dimensional subspaces significantly enhances alignment quality and task performance, revealing distinct semantic and geometric data representations.

Authors:Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea
Title: Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Abstract:
The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship
中文摘要:本研究通过逆向工程分析多国内容审查决策,结合Shapley值和LLM生成解释来探究内容审核的内在机制,揭示了不同国家审查内容的特征模式,并评估了大语言模型在内容审核中的应用效果。
English Summary: This study investigates the hidden mechanisms of content moderation by reverse-engineering censorship decisions across countries and analyzing them through Shapley values and LLM-generated explanations, revealing distinct patterns in moderated content while evaluating LLMs' effectiveness in this domain.

Authors:Bowen Pang, Kai Li, Feifan Wang
Title: Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
Abstract:
The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes significant limitations in static batching methods. Current inference serving systems often treat batch sizes as fixed hyper-parameters, hindering real-time adaptation to varying system conditions. In this paper, we propose a dynamic batching method that continuously monitors memory utilization and adheres to service-level agreements (SLAs) to enable real-time batch size configuration adjustment. The method comprises two core components: a memory-aware batch scheduler that dynamically allocates GPU resources and a latency feedback mechanism that optimizes decoding processes under SLA constraints. The numerical experiments demonstrate throughput gains of 8% to 28% and capacity improvements of 22% compared to traditional static batching methods, while maintaining full compatibility with existing inference infrastructure. These results highlight the effectiveness of dynamic batching in balancing computational efficiency and quality-of-service requirements for contemporary LLM deployment scenarios. The source code of this work is publicly available at https://github.com/KevinLee1110/dynamic-batching.
中文: 本文提出了一种动态批处理方法,可根据内存使用和服务级别协议实时调整批处理规模,相比静态方法实现了8-28%的吞吐量提升和22%的容量增长,同时保持与现有基础设施的完全兼容。
English: This paper introduces a dynamic batching method that adapts batch sizes in real-time based on memory usage and service-level agreements, achieving 8-28% higher throughput and 22% greater capacity than static approaches while maintaining compatibility with existing infrastructure.

Authors:Chengqi Zheng, Haiyan Yin, Jianda Chen, Terence Ng, Yew-Soon Ong, Ivor Tsang
Title: Mastering Continual Reinforcement Learning through Fine-Grained Sparse Network Allocation and Dormant Neuron Exploration
Abstract:
Continual Reinforcement Learning (CRL) is essential for developing agents that can learn, adapt, and accumulate knowledge over time. However, a fundamental challenge persists as agents must strike a delicate balance between plasticity, which enables rapid skill acquisition, and stability, which ensures long-term knowledge retention while preventing catastrophic forgetting. In this paper, we introduce SSDE, a novel structure-based approach that enhances plasticity through a fine-grained allocation strategy with Structured Sparsity and Dormant-guided Exploration. SSDE decomposes the parameter space into forward-transfer (frozen) parameters and task-specific (trainable) parameters. Crucially, these parameters are allocated by an efficient co-allocation scheme under sparse coding, ensuring sufficient trainable capacity for new tasks while promoting efficient forward transfer through frozen parameters. However, structure-based methods often suffer from rigidity due to the accumulation of non-trainable parameters, limiting exploration and adaptability. To address this, we further introduce a sensitivity-guided neuron reactivation mechanism that systematically identifies and resets dormant neurons, which exhibit minimal influence in the sparse policy network during inference. This approach effectively enhance exploration while preserving structural efficiency. Extensive experiments on the CW10-v1 Continual World benchmark demonstrate that SSDE achieves state-of-the-art performance, reaching a success rate of 95%, surpassing prior methods significantly in both plasticity and stability trade-offs (code is available at: https://github.com/chengqiArchy/SSDE).
中文: SSDE是一种新颖的持续强化学习方法,通过结构化稀疏性和休眠神经元激活机制在保持稳定性的同时增强可塑性,在基准测试中达到了95%的最优成功率。
English: SSDE is a novel continual reinforcement learning method that enhances plasticity through structured sparsity and dormant neuron reactivation while maintaining stability, achieving state-of-the-art 95% success rate on benchmarks.

Authors:Ruoxuan Zhang, Hongxia Xie, Yi Yao, Jian-Yu Jiang-Lin, Bin Wen, Ling Lo, Hong-Han Shuai, Yung-Hui Li, Wen-Huang Cheng
Title: RecipeGen: A Benchmark for Real-World Recipe Image Generation
Abstract:
Recipe image generation is an important challenge in food computing, with applications from culinary education to interactive recipe platforms. However, there is currently no real-world dataset that comprehensively connects recipe goals, sequential steps, and corresponding images. To address this, we introduce RecipeGen, the first real-world goal-step-image benchmark for recipe generation, featuring diverse ingredients, varied recipe steps, multiple cooking styles, and a broad collection of food categories. Data is in https://github.com/zhangdaxia22/RecipeGen.
Chinese: RecipeGen 是首个真实世界的菜谱图像生成基准数据集,它通过涵盖多样食材、烹饪风格和食物类别,解决了菜谱目标、步骤与图像之间缺乏全面关联的问题。
English: RecipeGen is introduced as the first real-world benchmark dataset for recipe image generation, addressing the lack of comprehensive connections between recipe goals, steps, and images across diverse ingredients, cooking styles, and food categories.

Authors:Baris Yilmaz, Erdem Akagündüz, Salih Tileylioglu
Title: Deep Sequence Models for Predicting Average Shear Wave Velocity from Strong Motion Records
Abstract:
This study explores the use of deep learning for predicting the time averaged shear wave velocity in the top 30 m of the subsurface ($V_{s30}$) at strong motion recording stations in Türkiye. $V_{s30}$ is a key parameter in site characterization and, as a result for seismic hazard assessment. However, it is often unavailable due to the lack of direct measurements and is therefore estimated using empirical correlations. Such correlations however are commonly inadequate in capturing complex, site-specific variability and this motivates the need for data-driven approaches. In this study, we employ a hybrid deep learning model combining convolutional neural networks (CNNs) and long short-term memory (LSTM) networks to capture both spatial and temporal dependencies in strong motion records. Furthermore, we explore how using different parts of the signal influence our deep learning model. Our results suggest that the hybrid approach effectively learns complex, nonlinear relationships within seismic signals. We observed that an improved P-wave arrival time model increased the prediction accuracy of $V_{s30}$. We believe the study provides valuable insights into improving $V_{s30}$ predictions using a CNN-LSTM framework, demonstrating its potential for improving site characterization for seismic studies. Our codes are available via this repo: https://github.com/brsylmz23/CNNLSTM_DeepEQ
本研究开发了一种结合卷积神经网络和长短期记忆网络的混合深度学习模型,通过改进P波分析并捕捉复杂信号关系,有效预测了土耳其地震记录中的地下剪切波速度(Vs30)。
This study develops a hybrid deep learning model combining CNNs and LSTMs to predict subsurface shear wave velocity (Vs30) from seismic records in Türkiye, demonstrating improved accuracy through enhanced P-wave analysis and capturing complex signal relationships.

Authors:Yifan Liu, Yu Fang, Zhouhan Lin
Title: DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
Abstract:
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective and subjective metrics across the LRS2 and LRS3 datasets. Our results demonstrate that DiVISe not only outperforms existing V2S models in acoustic intelligibility but also scales more effectively with increased data and model parameters. Code and weights can be found at https://github.com/PussyCat0700/DiVISe.
中文摘要:DiVISe是一种端到端的视频语音合成模型,仅从无声视频直接生成梅尔频谱图,无需声学提示即可在保持说话人特征的同时实现卓越的语音可懂度。
English Summary: DiVISe is an end-to-end video-to-speech model that directly generates Mel-spectrograms from silent video, achieving superior acoustic intelligibility and speaker characteristic preservation without requiring acoustic hints.

Authors:Bill Cassidy, Christian McBride, Connah Kendrick, Neil D. Reeves, Joseph M. Pappachan, Shaghayegh Raad, Moi Hoon Yap
Title: Gaussian Random Fields as an Abstract Representation of Patient Metadata for Multimodal Medical Image Segmentation
Abstract:
The growing rate of chronic wound occurrence, especially in patients with diabetes, has become a concerning trend in recent years. Chronic wounds are difficult and costly to treat, and have become a serious burden on health care systems worldwide. Chronic wounds can have devastating consequences for the patient, with infection often leading to reduced quality of life and increased mortality risk. Innovative deep learning methods for the detection and monitoring of such wounds have the potential to reduce the impact to both patient and clinician. We present a novel multimodal segmentation method which allows for the introduction of patient metadata into the training workflow whereby the patient data are expressed as Gaussian random fields. Our results indicate that the proposed method improved performance when utilising multiple models, each trained on different metadata categories. Using the Diabetic Foot Ulcer Challenge 2022 test set, when compared to the baseline results (intersection over union = 0.4670, Dice similarity coefficient = 0.5908) we demonstrate improvements of +0.0220 and +0.0229 for intersection over union and Dice similarity coefficient respectively. This paper presents the first study to focus on integrating patient data into a chronic wound segmentation workflow. Our results show significant performance gains when training individual models using specific metadata categories, followed by average merging of prediction masks using distance transforms. All source code for this study is available at: https://github.com/mmu-dermatology-research/multimodal-grf
中文: 本研究提出了一种新颖的多模态分割方法,将患者元数据以高斯随机场形式融入训练流程,在慢性伤口分析中通过深度学习技术显著提升了性能表现。
English: This study introduces a novel multimodal segmentation method that integrates patient metadata as Gaussian random fields, demonstrating improved performance in chronic wound analysis using deep learning techniques.

Authors:Yunkai Gao, Jiaming Guo, Fan Wu, Rui Zhang
Title: Policy Constraint by Only Support Constraint for Offline Reinforcement Learning
Abstract:
Offline reinforcement learning (RL) aims to optimize a policy by using pre-collected datasets, to maximize cumulative rewards. However, offline reinforcement learning suffers challenges due to the distributional shift between the learned and behavior policies, leading to errors when computing Q-values for out-of-distribution (OOD) actions. To mitigate this issue, policy constraint methods aim to constrain the learned policy's distribution with the distribution of the behavior policy or confine action selection within the support of the behavior policy. However, current policy constraint methods tend to exhibit excessive conservatism, hindering the policy from further surpassing the behavior policy's performance. In this work, we present Only Support Constraint (OSC) which is derived from maximizing the total probability of learned policy in the support of behavior policy, to address the conservatism of policy constraint. OSC presents a regularization term that only restricts policies to the support without imposing extra constraints on actions within the support. Additionally, to fully harness the performance of the new policy constraints, OSC utilizes a diffusion model to effectively characterize the support of behavior policies. Experimental evaluations across a variety of offline RL benchmarks demonstrate that OSC significantly enhances performance, alleviating the challenges associated with distributional shifts and mitigating conservatism of policy constraints. Code is available at https://github.com/MoreanP/OSC.
中文摘要:离线强化学习面临分布偏移和策略约束过度保守的问题,本文提出的仅支持约束(OSC)方法通过将策略限制在行为策略支持集内并利用扩散模型,有效缓解了这些问题,在多个基准测试中显著提升了性能。
English Summary: Offline reinforcement learning faces challenges from distributional shifts and excessive conservatism in policy constraints, which the proposed Only Support Constraint (OSC) method addresses by restricting policies to behavior policy support using a diffusion model, significantly improving performance across benchmarks.

Authors:Orestis Tsirakis, Konstantinos Fysarakis, Vasileios Mavroeidis, Ioannis Papaefstathiou
Title: Operationalizing Cybersecurity Knowledge: Design, Implementation & Evaluation of a Knowledge Management System for CACAO Playbooks
Abstract:
Modern cybersecurity threats are growing in complexity, targeting increasingly intricate & interconnected systems. To effectively defend against these evolving threats, security teams utilize automation & orchestration to enhance response efficiency and consistency. In that sense, cybersecurity playbooks are key enablers, providing a structured, reusable, and continuously improving approach to incident response, enabling organizations to codify requirements, domain expertise, and best practices and automate decision-making processes to the extent possible. The emerging Collaborative Automated Course of Action Operations (CACAO) standard defines a common machine-processable schema for cybersecurity playbooks, facilitating interoperability for their exchange and ensuring the ability to orchestrate and automate cybersecurity operations. However, despite its potential and the fact that it is a relatively new standardization work, there is a lack of tools to support its adoption and, in particular, the management & lifecycle development of CACAO playbooks, limiting their practical deployment. Motivated by the above, this work presents the design, development, and evaluation of a Knowledge Management System (KMS) for managing CACAO cybersecurity playbooks throughout their lifecycle, providing essential tools to streamline playbook management. Using open technologies & standards, the proposed approach fosters standards-based interoperability & enhances the usability of state-of-the-art cybersecurity orchestration & automation primitives. To encourage adoption, the resulting implementation is released as open-source, which, to the extent of our knowledge, comprises the first publicly available & documented work in this domain, supporting the broader uptake of CACAO playbooks & promoting the widespread use of interoperable automation and orchestration mechanisms in cybersecurity operations.
中文摘要:CACAO网络安全剧本标准虽能提升自动化事件响应能力,但因缺乏管理工具限制了实际应用,为此本研究开发了开源知识管理系统,通过标准化工具实现剧本全生命周期管理并推动协同运作。
English Summary: The CACAO cybersecurity playbook standard enhances automated incident response, but adoption is hindered by a lack of management tools, prompting the development of an open-source Knowledge Management System to streamline playbook lifecycle management and promote interoperability.

Authors:Orestis Tsirakis, Konstantinos Fysarakis, Vasileios Mavroeidis, Ioannis Papaefstathiou
Title: Operationalizing Cybersecurity Knowledge: Design, Implementation & Evaluation of a Knowledge Management System for CACAO Playbooks
Abstract:
Modern cybersecurity threats are growing in complexity, targeting increasingly intricate & interconnected systems. To effectively defend against these evolving threats, security teams utilize automation & orchestration to enhance response efficiency and consistency. In that sense, cybersecurity playbooks are key enablers, providing a structured, reusable, and continuously improving approach to incident response, enabling organizations to codify requirements, domain expertise, and best practices and automate decision-making processes to the extent possible. The emerging Collaborative Automated Course of Action Operations (CACAO) standard defines a common machine-processable schema for cybersecurity playbooks, facilitating interoperability for their exchange and ensuring the ability to orchestrate and automate cybersecurity operations. However, despite its potential and the fact that it is a relatively new standardization work, there is a lack of tools to support its adoption and, in particular, the management & lifecycle development of CACAO playbooks, limiting their practical deployment. Motivated by the above, this work presents the design, development, and evaluation of a Knowledge Management System (KMS) for managing CACAO cybersecurity playbooks throughout their lifecycle, providing essential tools to streamline playbook management. Using open technologies & standards, the proposed approach fosters standards-based interoperability & enhances the usability of state-of-the-art cybersecurity orchestration & automation primitives. To encourage adoption, the resulting implementation is released as open-source, which, to the extent of our knowledge, comprises the first publicly available & documented work in this domain, supporting the broader uptake of CACAO playbooks & promoting the widespread use of interoperable automation and orchestration mechanisms in cybersecurity operations.
中文摘要:CACAO网络安全剧本标准虽能提升自动化事件响应能力,但因缺乏管理工具限制了实际应用,为此本研究开发了开源知识管理系统,通过标准化工具实现剧本全生命周期管理并推动协同运作。
English Summary: The CACAO cybersecurity playbook standard enhances automated incident response, but adoption is hindered by a lack of management tools, prompting the development of an open-source Knowledge Management System to streamline playbook lifecycle management and promote interoperability.

Authors:Qingyuan Zhou, Yuehu Gong, Weidong Yang, Jiaze Li, Yeqi Luo, Baixin Xu, Shuhao Li, Ben Fei, Ying He
Title: MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
Abstract:
Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches--one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction. Code is available at https://github.com/TsingyuanChou/MGSR.
中文: 提出的MGSR方法通过2D和3D高斯泼溅分支的相互促进框架,利用交替优化和几何引导的光照分解,在多变光照条件下同时提升了渲染质量和表面重建精度。
English: The proposed MGSR method introduces a mutual-boosted framework between 2D and 3D Gaussian splatting branches that enhances both rendering quality under varied lighting and surface reconstruction accuracy through alternating optimization and geometry-guided illumination decomposition.

Authors:Ruixi Lin, Ziqiao Wang, Yang You
Title: Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Abstract:
Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble debiasing method, which enables flexible rectifications of in-context learned class probabilities at both class and sample levels. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. More importantly, we perform analyses on the resulted probability correction scheme, showing that sample-level corrections are necessary to elevate weak classes. Due to effectively correcting weak classes, our method also brings significant performance gains to a larger model variant, Llama-2-70B, especially on a biomedical domain task, further demonstrating the necessity of ensemble debiasing at both levels. Our source code is available at https://github.com/NUS-HPC-AI-Lab/DCS.
Chinese: 语言模型在文本分类中总体准确率高但存在类别不平衡问题,我们提出的基于Heaviside阶跃函数的集成去偏方法通过在类别和样本层面灵活修正概率,有效提升了弱类别性能,实现了均衡且领先的分类效果。
English: Language models achieve high overall accuracy in text classification but suffer from class imbalance, which our proposed Heaviside step function-based ensemble debiasing method effectively addresses by correcting probabilities at both class and sample levels, leading to state-of-the-art balanced performance.

Authors:Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, Yanbin Hao
Title: Accelerating Diffusion Transformer via Gradient-Optimized Cache
Abstract:
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50\% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC's superior trade-off between efficiency and quality. With 50\% cached blocks, GOC achieves IS 216.28 (26.3\% higher) and FID 3.907 (43\% lower) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements. Code is available at https://github.com/qiujx0520/GOC_ICCV2025.git.
中文: 提出的梯度优化缓存(GOC)通过动态传播梯度差异和对齐拐点更新,解决了扩散变换器采样中的误差累积问题,在保持计算效率的同时显著提升了生成质量。
English: The proposed Gradient-Optimized Cache (GOC) addresses error accumulation in diffusion transformer sampling by dynamically propagating gradient differences and aligning updates with inflection points, achieving significant quality improvements while maintaining computational efficiency.

Authors:Bowen Wu, Wenqing Wang, Haoran Li, Ying Li, Jingsong Yu, Baoxun Wang
Title: Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
Abstract:
Proactive dialogue systems aim to empower chatbots with the capability of leading conversations towards specific targets, thereby enhancing user engagement and service autonomy. Existing systems typically target pre-defined keywords or entities, neglecting user attributes and preferences implicit in dialogue history, hindering the development of long-term user intimacy. To address these challenges, we take a radical step towards building a more human-like conversational agent by integrating proactive dialogue systems with long-term memory into a unified framework. Specifically, we define a novel task named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation, designed to steer dialogues towards relevant historical topics at the right time. The effectiveness of our dataset and models is validated through both automatic and human evaluations. We release the open-source framework and dataset at https://github.com/FrontierLabs/MapDia.
Chinese Summary: 本研究提出将主动对话系统与长期记忆相结合的统一框架,定义了MapDia新任务并构建首个中文记忆感知数据集ChMapData,使聊天机器人能在适当时机引导对话转向相关历史话题。
English Summary: This study introduces a unified framework integrating proactive dialogue systems with long-term memory, proposing the novel MapDia task and creating the first Chinese dataset (ChMapData) to enable chatbots to steer conversations toward relevant historical topics at appropriate moments.

Authors:Wenhao Wang, Zijie Yu, Rui Ye, Jianqing Zhang, Siheng Chen, Yanfeng Wang
Title: FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data
Abstract:
Mobile agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field. To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: https://github.com/wwh0411/FedMABench with the datasets at: https://huggingface.co/datasets/wwh0411/FedMABench.
中文摘要:为解决移动智能体联邦学习领域缺乏标准化基准的问题,FedMABench作为首个综合性评估框架被提出,其通过多组实验证实联邦算法优于本地训练,并揭示了应用分布对异构环境的关键影响及跨类别应用间的训练关联性。
English Summary: To address the lack of standardized benchmarks in federated learning for mobile agents, FedMABench is introduced as the first comprehensive evaluation framework, featuring diverse datasets and algorithms that demonstrate federated training's superiority over local methods while revealing key insights about app distribution and inter-category correlations.

Authors:Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma
Title: RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
Abstract:
Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR .
中文摘要:RocketEval提出了一种采用轻量级大语言模型作为评估者的自动化评估方法,通过基于清单的评分机制解决不确定性和位置偏差问题,在实现与人类评估高度相关的同时将大规模评估成本降低了50倍以上。
English Summary: RocketEval introduces a cost-effective automated evaluation method using lightweight LLMs as judges, achieving human-level correlation while reducing costs by over 50 times through checklist-based grading that addresses uncertainty and bias.

Authors:Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Title: R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Abstract:
Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero
中文摘要:DeepSeek R1首次在20亿参数多模态模型上成功复现了复杂推理能力的涌现,通过SAT数据的强化学习实现了准确率大幅提升,同时揭示了指令模型和简单奖励机制在激发推理能力方面的局限性。
English Summary: DeepSeek R1 successfully replicated complex reasoning emergence in a 2B multimodal model using reinforcement learning on SAT data, achieving significant accuracy gains while revealing challenges with instruct models and naive rewards.

Authors:Xi Li, Tong Rao, Cihui Pan
Title: EDM: Efficient Deep Feature Matching
Abstract:
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github.com/chicleee/EDM.
中文: 提出的高效深度特征匹配网络(EDM)通过采用更深层CNN、关联注入模块和轻量级回归头,在保证精度的同时显著提升了特征匹配效率,在多个基准测试中表现优异。
English: The proposed Efficient Deep feature Matching network (EDM) enhances detector-free feature matching by integrating a deeper CNN with a correlation injection module and a lightweight regression head, achieving both high accuracy and superior efficiency across benchmarks.

Authors:Shufang Zhang, Jiazheng Wu, Jiacheng He, Kaiyi Wang, Shan An
Title: HyperGraph ROS: An Open-Source Robot Operating System for Hybrid Parallel Computing based on Computational HyperGraph
Abstract:
This paper presents HyperGraph ROS, an open-source robot operating system that unifies intra-process, inter-process, and cross-device computation into a computational hypergraph for efficient message passing and parallel execution. In order to optimize communication, HyperGraph ROS dynamically selects the optimal communication mechanism while maintaining a consistent API. For intra-process messages, Intel-TBB Flow Graph is used with C++ pointer passing, which ensures zero memory copying and instant delivery. Meanwhile, inter-process and cross-device communication seamlessly switch to ZeroMQ. When a node receives a message from any source, it is immediately activated and scheduled for parallel execution by Intel-TBB. The computational hypergraph consists of nodes represented by TBB flow graph nodes and edges formed by TBB pointer-based connections for intra-process communication, as well as ZeroMQ links for inter-process and cross-device communication. This structure enables seamless distributed parallelism. Additionally, HyperGraph ROS provides ROS-like utilities such as a parameter server, a coordinate transformation tree, and visualization tools. Evaluation in diverse robotic scenarios demonstrates significantly higher transmission and throughput efficiency compared to ROS 2. Our work is available at https://github.com/wujiazheng2020a/hyper_graph_ros.
中文: HyperGraph ROS 是一种开源机器人操作系统,它将进程内、进程间和跨设备计算统一为计算超图,通过动态选择通信机制优化消息传递,实现高效并行执行,性能显著优于ROS 2。
English: HyperGraph ROS is an open-source robot operating system that integrates intra-process, inter-process, and cross-device computation into a computational hypergraph, optimizing communication with dynamic mechanism selection and enabling efficient parallel execution for enhanced performance over ROS 2.

Authors:Shibo Feng, Wanjin Feng, Xingyu Gao, Peilin Zhao, Zhiqi Shen
Title: TS-LIF: A Temporal Segment Spiking Neuron Network for Time Series Forecasting
Abstract:
Spiking Neural Networks (SNNs) offer a promising, biologically inspired approach for processing spatiotemporal data, particularly for time series forecasting. However, conventional neuron models like the Leaky Integrate-and-Fire (LIF) struggle to capture long-term dependencies and effectively process multi-scale temporal dynamics. To overcome these limitations, we introduce the Temporal Segment Leaky Integrate-and-Fire (TS-LIF) model, featuring a novel dual-compartment architecture. The dendritic and somatic compartments specialize in capturing distinct frequency components, providing functional heterogeneity that enhances the neuron's ability to process both low- and high-frequency information. Furthermore, the newly introduced direct somatic current injection reduces information loss during intra-neuronal transmission, while dendritic spike generation improves multi-scale information extraction. We provide a theoretical stability analysis of the TS-LIF model and explain how each compartment contributes to distinct frequency response characteristics. Experimental results show that TS-LIF outperforms traditional SNNs in time series forecasting, demonstrating better accuracy and robustness, even with missing data. TS-LIF advances the application of SNNs in time-series forecasting, providing a biologically inspired approach that captures complex temporal dynamics and offers potential for practical implementation in diverse forecasting scenarios. The source code is available at https://github.com/kkking-kk/TS-LIF.
Chinese: TS-LIF模型采用双区室架构,通过树突和胞体分别处理不同频率信息,显著提升了脉冲神经网络处理多尺度时间动态的能力,在时间序列预测中展现出优于传统模型的精度和鲁棒性。
English: The TS-LIF model introduces a dual-compartment architecture that enhances spiking neural networks' ability to process multi-scale temporal dynamics, demonstrating superior accuracy and robustness in time series forecasting compared to traditional models.

Authors:Wenhao Liang, Wei Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen
Title: We Care Each Pixel: Calibrating on Medical Segmentation Model
Abstract:
Medical image segmentation is fundamental for computer-aided diagnostics, providing accurate delineation of anatomical structures and pathological regions. While common metrics such as Accuracy, DSC, IoU, and HD primarily quantify spatial agreement between predictions and ground-truth labels, they do not assess the calibration quality of segmentation models, which is crucial for clinical reliability. To address this limitation, we propose pixel-wise Expected Calibration Error (pECE), a novel metric that explicitly measures miscalibration at the pixel level, thereby ensuring both spatial precision and confidence reliability. We further introduce a morphological adaptation strategy that applies morphological operations to ground-truth masks before computing calibration losses, particularly benefiting margin-based losses such as Margin SVLS and NACL. Additionally, we present the Signed Distance Calibration Loss (SDC), which aligns boundary geometry with calibration objectives by penalizing discrepancies between predicted and ground-truth signed distance functions (SDFs). Extensive experiments demonstrate that our method not only enhances segmentation performance but also improves calibration quality, yielding more trustworthy confidence estimates. Code is available at: https://github.com/EagleAdelaide/SDC-Loss.
Chinese Summary: 本文提出像素级期望校准误差(pECE)和符号距离校准损失(SDC),通过形态学自适应策略提升医学图像分割的校准质量与空间精度,弥补传统指标不足。
English Summary: This paper introduces pixel-wise Expected Calibration Error (pECE) and Signed Distance Calibration Loss (SDC) to improve both segmentation accuracy and confidence calibration in medical imaging, addressing limitations of traditional metrics.

Authors:Chengwei Zhao, Kun Hu, Jie Xu, Lijun Zhao, Baiwen Han, Kaidi Wu, Maoshan Tian, Shenghai Yuan
Title: Adaptive-LIO: Enhancing Robustness and Precision through Environmental Adaptation in LiDAR Inertial Odometry
Abstract:
The emerging Internet of Things (IoT) applications, such as driverless cars, have a growing demand for high-precision positioning and navigation. Nowadays, LiDAR inertial odometry becomes increasingly prevalent in robotics and autonomous driving. However, many current SLAM systems lack sufficient adaptability to various scenarios. Challenges include decreased point cloud accuracy with longer frame intervals under the constant velocity assumption, coupling of erroneous IMU information when IMU saturation occurs, and decreased localization accuracy due to the use of fixed-resolution maps during indoor-outdoor scene transitions. To address these issues, we propose a loosely coupled adaptive LiDAR-Inertial-Odometry named \textbf{Adaptive-LIO}, which incorporates adaptive segmentation to enhance mapping accuracy, adapts motion modality through IMU saturation and fault detection, and adjusts map resolution adaptively using multi-resolution voxel maps based on the distance from the LiDAR center. Our proposed method has been tested in various challenging scenarios, demonstrating the effectiveness of the improvements we introduce. The code is open-source on GitHub: \href{https://github.com/chengwei0427/adaptive_lio}{Adaptive-LIO}.
中文摘要:提出的Adaptive-LIO系统通过自适应分割、IMU故障检测和多分辨率地图,解决了现有激光雷达惯性里程计在多种场景下的精度不足问题。
English Summary: The proposed Adaptive-LIO system addresses limitations in current LiDAR-inertial odometry by incorporating adaptive segmentation, IMU fault detection, and multi-resolution mapping to improve accuracy across diverse scenarios.

Authors:Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, Benjamin Zorn
Title: PromptPex: Automatic Test Generation for Language Model Prompts
Abstract:
Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.
中文: 大语言模型提示虽类似软件但需新的稳健性方法,为此开发了PromptPex工具,能自动生成和评估单元测试,确保提示在不同模型中的可靠性。
English: Large language model prompts function like software but require new robustness approaches, leading to the development of PromptPex, an automated tool for generating and evaluating unit tests to ensure prompt reliability across different models.

Authors:Chang Yu, Wenxin Du, Zeshun Zong, Alejandro Castro, Chenfanfu Jiang, Xuchen Han
Title: A Convex Formulation of Material Points and Rigid Bodies with GPU-Accelerated Async-Coupling for Interactive Simulation
Abstract:
We present a novel convex formulation that weakly couples the Material Point Method (MPM) with rigid body dynamics through frictional contact, optimized for efficient GPU parallelization. Our approach features an asynchronous time-splitting scheme to integrate MPM and rigid body dynamics under different time step sizes. We develop a globally convergent quasi-Newton solver tailored for massive parallelization, achieving up to 500x speedup over previous convex formulations without sacrificing stability. Our method enables interactive-rate simulations of robotic manipulation tasks with diverse deformable objects including granular materials and cloth, with strong convergence guarantees. We detail key implementation strategies to maximize performance and validate our approach through rigorous experiments, demonstrating superior speed, accuracy, and stability compared to state-of-the-art MPM simulators for robotics. We make our method available in the open-source robotics toolkit, Drake.
中文: 我们提出了一种新颖的凸优化公式,通过摩擦接触将物质点法与刚体动力学弱耦合,实现了高效的GPU并行化,在保持稳定性的同时速度提升高达500倍,适用于机器人交互模拟。
English: We introduce a convex formulation that efficiently couples MPM with rigid body dynamics via frictional contact, optimized for GPU parallelization and achieving up to 500x speedup while maintaining stability for interactive robotic simulations.

Authors:Yordan P. Raykov, Hengrui Luo, Justin D. Strait, Wasiur R. KhudaBukhsh
Title: Kernel-based estimators for functional causal effects
Abstract:
We propose causal effect estimators based on empirical Fréchet means and operator-valued kernels, tailored to functional data spaces. These methods address the challenges of high-dimensionality, sequential ordering, and model complexity while preserving robustness to treatment misspecification. Using structural assumptions, we obtain compact representations of potential outcomes, enabling scalable estimation of causal effects over time and across covariates. We provide both theoretical, regarding the consistency of functional causal effects, as well as empirical comparison of a range of proposed causal effect estimators. Applications to binary treatment settings with functional outcomes illustrate the framework's utility in biomedical monitoring, where outcomes exhibit complex temporal dynamics. Our estimators accommodate scenarios with registered covariates and outcomes, aligning them to the Fréchet means, as well as cases requiring higher-order representations to capture intricate covariate-outcome interactions. These advancements extend causal inference to dynamic and non-linear domains, offering new tools for understanding complex treatment effects in functional data settings.
中文: 本研究提出基于经验Fréchet均值和算子值核的因果效应估计方法,针对函数型数据的高维性和时序动态特性,在生物医学监测中实现了鲁棒且可扩展的因果推断。
English: This study introduces causal effect estimators using empirical Fréchet means and operator-valued kernels to handle functional data's high dimensionality and temporal dynamics, ensuring robustness and scalability in biomedical applications.

Authors:Xuheng Cai, Erica Zhang
Title: HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model
Abstract:
Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Our code is available at https://github.com/Rick-Cai/HieroLM/.
中文:本文提出HieroLM新方法,将象形文字修复建模为语言模型的下一个词预测任务,通过利用上下文信息实现了超过44%的准确率,能有效处理严重损坏或缺失的象形文字,其性能优于传统计算机视觉方法。
English: This paper introduces HieroLM, a novel approach that models hieroglyph recovery as a next word prediction task using language models, achieving over 44% accuracy and effectively handling severely damaged or missing hieroglyphs by leveraging contextual information, outperforming traditional computer vision methods.

Authors:Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
Title: LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Abstract:
Despite recent efforts in understanding the compression impact on large language models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (for example, question answering, common sense reasoning), their detailed study on multi-modal Large Vision-Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thoroughly study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization for the KV cache and weights. With this framework we demonstrate on ten different multi-modal datasets with different capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. Code will be open-sourced at https://github.com/opengear-project/LVLM-compress-bench.
中文: 本研究提出LVLM-Compress-Bench框架,全面评估压缩技术对多模态大视觉语言模型在各项任务中性能的影响,揭示了其在通用指标和伦理指标上的多样化表现。
English: This study introduces LVLM-Compress-Bench, a framework to comprehensively evaluate how compression techniques affect the performance of multimodal large vision-language models across various tasks, revealing diverse impacts on both general and ethical metrics.

Authors:Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Deforges
Title: Energy-Latency Attacks: A New Adversarial Threat to Deep Learning
Abstract:
The growing computational demand for deep neural networks ( DNNs) has raised concerns about their energy consumption and carbon footprint, particularly as the size and complexity of the models continue to increase. To address these challenges, energy-efficient hardware and custom accelerators have become essential. Additionally, adaptable DNN s are being developed to dynamically balance performance and efficiency. The use of these strategies became more common to enable sustainable AI deployment. However, these efficiency-focused designs may also introduce vulnerabilities, as attackers can potentially exploit them to increase latency and energy usage by triggering their worst-case-performance scenarios. This new type of attack, called energy-latency attacks, has recently gained significant research attention, focusing on the vulnerability of DNN s to this emerging attack paradigm, which can trigger denial-of-service ( DoS) attacks. This paper provides a comprehensive overview of current research on energy-latency attacks, categorizing them using the established taxonomy for traditional adversarial attacks. We explore different metrics used to measure the success of these attacks and provide an analysis and comparison of existing attack strategies. We also analyze existing defense mechanisms and highlight current challenges and potential areas for future research in this developing field. The GitHub page for this work can be accessed at https://github.com/hbrachemi/Survey_energy_attacks/
中文摘要:本文综述了针对深度神经网络的能量-延迟攻击,这类攻击通过利用高效能设计触发最差性能场景,并系统分析了攻击策略、防御机制及研究挑战。
English Summary: This paper surveys energy-latency attacks on deep neural networks, which exploit efficiency-focused designs to trigger worst-case performance scenarios, and systematically analyzes attack strategies, defenses, and research challenges.

Authors:Armin Ariamajd, Raquel López-Ríos de Castro, Andrea Volkamer
Title: PyPackIT: Automated Research Software Engineering for Scientific Python Applications on GitHub
Abstract:
The increasing importance of Computational Science and Engineering has highlighted the need for high-quality scientific software. However, research software development is often hindered by limited funding, time, staffing, and technical resources. To address these challenges, we introduce PyPackIT, a cloud-based automation tool designed to streamline research software engineering in accordance with FAIR (Findable, Accessible, Interoperable, and Reusable) and Open Science principles. PyPackIT is a user-friendly, ready-to-use software that enables scientists to focus on the scientific aspects of their projects while automating repetitive tasks and enforcing best practices throughout the software development life cycle. Using modern Continuous software engineering and DevOps methodologies, PyPackIT offers a robust project infrastructure including a build-ready Python package skeleton, a fully operational documentation and test suite, and a control center for dynamic project management and customization. PyPackIT integrates seamlessly with GitHub's version control system, issue tracker, and pull-based model to establish a fully-automated software development workflow. Exploiting GitHub Actions, PyPackIT provides a cloud-native Agile development environment using containerization, Configuration-as-Code, and Continuous Integration, Deployment, Testing, Refactoring, and Maintenance pipelines. PyPackIT is an open-source software suite that seamlessly integrates with both new and existing projects via a public GitHub repository template at https://github.com/repodynamics/pypackit.
中文: PyPackIT 是一款基于云的自动化工具,通过实施FAIR原则和DevOps方法简化研究软件开发,让科研人员能专注于研究,同时自动化重复性任务。
English: PyPackIT is a cloud-based automation tool that streamlines research software development by implementing FAIR principles and DevOps methodologies, enabling scientists to focus on research while automating repetitive tasks.

Authors:Thilo Reinold, Suman Ghosh, Guillermo Gallego
Title: Combined Physics and Event Camera Simulator for Slip Detection
Abstract:
Robot manipulation is a common task in fields like industrial manufacturing. Detecting when objects slip from a robot's grasp is crucial for safe and reliable operation. Event cameras, which register pixel-level brightness changes at high temporal resolution (called ``events''), offer an elegant feature when mounted on a robot's end effector: since they only detect motion relative to their viewpoint, a properly grasped object produces no events, while a slipping object immediately triggers them. To research this feature, representative datasets are essential, both for analytic approaches and for training machine learning models. The majority of current research on slip detection with event-based data is done on real-world scenarios and manual data collection, as well as additional setups for data labeling. This can result in a significant increase in the time required for data collection, a lack of flexibility in scene setups, and a high level of complexity in the repetition of experiments. This paper presents a simulation pipeline for generating slip data using the described camera-gripper configuration in a robot arm, and demonstrates its effectiveness through initial data-driven experiments. The use of a simulator, once it is set up, has the potential to reduce the time spent on data collection, provide the ability to alter the setup at any time, simplify the process of repetition and the generation of arbitrarily large data sets. Two distinct datasets were created and validated through visual inspection and artificial neural networks (ANNs). Visual inspection confirmed photorealistic frame generation and accurate slip modeling, while three ANNs trained on this data achieved high validation accuracy and demonstrated good generalization capabilities on a separate test set, along with initial applicability to real-world data. Project page: https://github.com/tub-rip/event_slip
中文: 本文提出了一种在机器人手臂上使用事件相机生成滑移检测数据的仿真流程,该方法通过视觉检查和神经网络验证,能有效减少数据采集时间并提高实验灵活性。
English: This paper introduces a simulation pipeline for generating slip detection data using event cameras on robot arms, which reduces data collection time and enhances flexibility while validating the approach through visual inspection and neural networks.

Authors:Donghyeok Shin, HeeSun Bae, Gyuwon Sim, Wanmo Kang, Il-Chul Moon
Title: Distilling Dataset into Neural Field
Abstract:
Utilizing a large-scale dataset is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing the large-scale dataset into a smaller synthetic dataset that retains the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of the large-scale dataset. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. We theoretically confirm that DDiF exhibits greater expressiveness than some previous literature when the utilized budget for a single synthetic instance is the same. Through extensive experiments, we demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel. We release the code at https://github.com/aailab-kaist/DDiF.
中文: 本文提出DDiF这一新型数据集蒸馏框架,通过神经场将大规模数据集高效压缩为紧凑的合成数据并保留关键信息,在图像、视频、音频和3D体素等多个领域展现出卓越性能。
English: This paper introduces DDiF, a novel dataset distillation framework that uses neural fields to efficiently compress large datasets into compact synthetic forms while preserving essential information, demonstrating superior performance across multiple domains including images, video, audio, and 3D data.

Authors:Mahfuz Ahmed Anik, Abdur Rahman, Azmine Toushik Wasi, Md Manjurul Ahsan
Title: Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Abstract:
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT-4o, producing contextually rich and culturally embedded translations, a critical advancement for Indigenous, regional, and low-resource languages. This research underscores the potential of multi-agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly available at: https://github.com/ciol-researchlab/Context-Aware_Translation_MAS
中文摘要:本研究提出了一种多智能体AI框架,通过专业代理和偏见缓解技术,为弱势语言群体提供文化适应性翻译,在保持语言准确性和文化相关性方面优于GPT-4o。
English Summary: This research introduces a multi-agent AI framework that enhances culturally adaptive translation for underserved languages, outperforming GPT-4o by preserving linguistic accuracy and cultural relevance through specialized agents and bias mitigation.

Authors:Diyaz Yakubov, David Hästbacka
Title: Comparative Analysis of Lightweight Kubernetes Distributions for Edge Computing: Security, Resilience and Maintainability
Abstract:
The increasing demand for real-time data processing in Internet of Things (IoT) devices has elevated the importance of edge computing, necessitating efficient and secure deployment of applications on resource-constrained devices. Kubernetes and its lightweight distributions (k0s, k3s, KubeEdge, and OpenYurt) extend container orchestration to edge environments, but their security, reliability, and maintainability have not been comprehensively analyzed. This study compares Kubernetes and these lightweight distributions by evaluating security compliance using kube-bench, simulating network outages to assess resiliency, and documenting maintainability. Results indicate that while k3s and k0s offer superior ease of development due to their simplicity, they have lower security compliance compared to Kubernetes, KubeEdge, and OpenYurt. Kubernetes provides a balanced approach but may be resource-intensive for edge deployments. KubeEdge and OpenYurt enhance security features and reliability under network outages but increase complexity and resource consumption. The findings highlight trade-offs between performance, security, resiliency, and maintainability, offering insights for practitioners deploying Kubernetes in edge environments.
中文: 本研究评估了Kubernetes及其轻量级发行版在边缘计算中的应用,揭示了性能与安全之间的权衡:k3s和k0s更易使用但安全性较低,而KubeEdge和OpenYurt在提供更好安全性和可靠性的同时增加了复杂性和资源消耗。
English: This study evaluates Kubernetes and its lightweight distributions for edge computing, revealing trade-offs where simpler options like k3s and k0s offer ease of use but lower security, while KubeEdge and OpenYurt provide better security and reliability at the cost of complexity and resources.

Authors:Stephen Chung, Wenyu Du, Jie Fu
Title: Learning from Failures in Multi-Attempt Reinforcement Learning
Abstract:
Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt
中文:最新研究表明,通过在多轮尝试任务中结合反馈来训练大语言模型,可显著提升其推理准确性和答案优化能力,效果优于单轮训练方法。
English: Recent research demonstrates that training large language models on multi-attempt tasks with feedback significantly enhances their reasoning accuracy and ability to refine responses, outperforming single-turn training methods.

Authors:Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu
Title: HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG. Our code and data are available at: https://github.com/0russwest0/HoH.
中文:HoH基准测试揭示,检索源中的过时信息会显著降低RAG系统性能,不仅削弱回答准确性还可能产生有害输出,这凸显了解决RAG时序挑战的迫切需求。
English: The HoH benchmark reveals that outdated information in retrieval sources significantly reduces RAG performance by lowering response accuracy and potentially causing harmful outputs, highlighting the need for solutions to temporal challenges in RAG systems.

Authors:Jingtian Yan, Zhifei Li, William Kang, Kevin Zheng, Yulun Zhang, Zhe Chen, Yue Zhang, Daniel Harabor, Stephen F. Smith, Jiaoyang Li
Title: Advancing MAPF towards the Real World: A Scalable Multi-Agent Realistic Testbed (SMART)
Abstract:
We present Scalable Multi-Agent Realistic Testbed (SMART), a realistic and efficient software tool for evaluating Multi-Agent Path Finding (MAPF) algorithms. MAPF focuses on planning collision-free paths for a group of agents. While state-ofthe-art MAPF algorithms can plan paths for hundreds of robots in seconds, they often rely on simplified robot models, making their real-world performance unclear. Researchers typically lack access to hundreds of physical robots in laboratory settings to evaluate the algorithms. Meanwhile, industrial professionals who lack expertise in MAPF require an easy-to-use simulator to efficiently test and understand the performance of MAPF algorithms in their specific settings. SMART fills this gap with several advantages: (1) SMART uses physics-engine-based simulators to create realistic simulation environments, accounting for complex real-world factors such as robot kinodynamics and execution uncertainties, (2) SMART uses an execution monitor framework based on the Action Dependency Graph, facilitating seamless integration with various MAPF algorithms and robot models, and (3) SMART scales to thousands of robots. The code is publicly available at https://github.com/smart-mapf/smart.
中文:SMART是一款可扩展的多智能体真实测试平台,它通过基于物理引擎的仿真环境,能够评估数千个机器人的路径规划算法,有效弥合了理论算法与实际应用之间的差距。
English: SMART is a realistic and scalable software tool designed to evaluate Multi-Agent Path Finding algorithms by incorporating physics-based simulations and handling thousands of robots, addressing the gap between theoretical algorithms and real-world applications.

Authors:Sumin Ha, Jun Hyeong Kim, Yinhua Piao, Sun Kim
Title: MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
Abstract:
Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.git.
中文:MV-CLAM是一个新颖的框架,它通过多查询变换器将多视角分子表征对齐到统一的文本空间,从而通过提升检索和描述准确性来增强分子推理能力。
English: MV-CLAM is a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer, enhancing molecular reasoning by improving retrieval and captioning accuracy.

Authors:Yansong Ning, Shuowei Cai, Wei Li, Jun Fang, Naiqiang Tan, Hua Chai, Hao Liu
Title: DiMA: An LLM-Powered Ride-Hailing Assistant at DiDi
Abstract:
On-demand ride-hailing services like DiDi, Uber, and Lyft have transformed urban transportation, offering unmatched convenience and flexibility. In this paper, we introduce DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing. Its goal is to provide seamless ride-hailing services and beyond through a natural and efficient conversational interface under dynamic and complex spatiotemporal urban contexts. To achieve this, we propose a spatiotemporal-aware order planning module that leverages external tools for precise spatiotemporal reasoning and progressive order planning. Additionally, we develop a cost-effective dialogue system that integrates multi-type dialog repliers with cost-aware LLM configurations to handle diverse conversation goals and trade-off response quality and latency. Furthermore, we introduce a continual fine-tuning scheme that utilizes real-world interactions and simulated dialogues to align the assistant's behavior with human preferred decision-making processes. Since its deployment in the DiDi application, DiMA has demonstrated exceptional performance, achieving 93% accuracy in order planning and 92% in response generation during real-world interactions. Offline experiments further validate DiMA capabilities, showing improvements of up to 70.23% in order planning and 321.27% in response generation compared to three state-of-the-art agent frameworks, while reducing latency by $0.72\times$ to $5.47\times$. These results establish DiMA as an effective, efficient, and intelligent mobile assistant for ride-hailing services. Our project is released at https://github.com/usail-hkust/DiMA and we also release the MCP service (https://mcp.didichuxing.com/api) to foster the ride-hailing research community.
中文: 本文介绍了在滴滴出行中部署的基于大语言模型的网约车助手DiMA,它通过时空感知的订单规划和高效对话系统,在实际应用中实现了高精度的订单规划和响应生成,相比现有框架有显著提升。
English: This paper introduces DiMA, an LLM-powered ride-hailing assistant deployed in DiDi Chuxing that achieves high accuracy in order planning and response generation through spatiotemporal reasoning and cost-effective dialogue systems, demonstrating significant improvements over existing frameworks.

Authors:Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, Marc Lelarge
Title: MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study
Abstract:
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
中文: 本研究通过逐步复杂的提示阶段,利用先进的大语言模型成功将488个定理中的478个从MiniF2F翻译为Rocq,并公开了数据集。
English: This study successfully translated 478 out of 488 theorems from MiniF2F to Rocq using advanced LLMs through progressively complex prompting stages, with the dataset made publicly available.

Authors:Zheng Hui, Yinheng Li, Dan zhao, Tianyi Chen, Colby Banbury, Kazuhito Koishida
Title: WinClick: GUI Grounding with Multimodal Large Language Models
Abstract:
Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key challenge in developing visual GUI agents: GUI grounding - the ability to accurately locate screen elements based on instructions. However, most existing GUI agents rely on structured data formats like DOM or HTML files in training or inferencing, which are inaccessible across all applications, particular in a general desktop environments such as Windows OS. To address this, we introduce WinClick, a novel visual GUI agent developed in Windows platform. WinClick leverages screenshots to detect actionable regions. To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training and propose an LLM-based method for aligning GUI grounding data. Additionally, we introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows. Our experiments demonstrate that WinClick, combined with GUI grounding pre-training, significantly outperforms existing baselines, offering a scalable solution for GUI automation in desktop environments. WinSpot is publicly available at https://github.com/zackhuiiiii/WinSpot.
中文:WinClick是一种创新的Windows视觉GUI代理,通过截图和增强的GUI基础预训练来自动化桌面任务,其性能优于现有方法,并得到WinSpot基准测试的支持。
English: WinClick is a novel visual GUI agent for Windows that uses screenshots and enhanced GUI grounding pre-training to automate desktop tasks, outperforming existing methods and supported by the WinSpot benchmark.

Authors:Houyi Li, Wenzhen Zheng, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Zhenyu Ding, Haoying Wang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang
Title: Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
Abstract:
The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well established, yet their effective deployment necessitates careful hyperparameter optimization. Although existing methods have explored the influence of hyperparameters on model performance, a principled and generalizable framework across model architectures and data recipes remains absent. In this study, we conduct an unprecedented empirical investigation training over 3,700 LLMs from scratch across 100 trillion tokens, consuming nearly one million NVIDIA H800 GPU hours to establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law. We empirically observe that, under fixed model size ($N$) and dataset size ($D$), the hyperparameter landscape exhibits convexity with a broad optimum, substantially reducing the complexity of hyperparameter search. Building on this insight, we formally define and empirically validate the Step Law: The optimal learning rate follows a power-law relationship with $N$ and $D$, while the optimal batch size is primarily influenced by $D$ and remains largely invariant to $N$.Notably, our estimated optima deviate from the global best performance found via exhaustive search by merely 0.094\% on the test set. To our best known, Step Law is the first that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data recipes. We contribute a universal, plug-and-play optimal hyperparameter tool for the community, which is expected to advance efficient LLM training at scale. All experimental code, data and checkpoints are publicly available at https://github.com/step-law/steplaw
中文摘要:本研究提出了Step Law这一适用于大语言模型预训练的超参数优化通用缩放定律,通过揭示超参数空间的凸性特征及学习率与模型/数据规模间的幂律关系,大幅简化了超参数搜索复杂度,其估计最优值与全局最优解的测试集差异仅为0.094%。
English Summary: This study introduces Step Law, a universal scaling law for hyperparameter optimization in LLM pre-training, which simplifies hyperparameter search by revealing convex landscapes and power-law relationships between optimal learning rates and model/data sizes, achieving near-optimal performance with minimal deviation.

Authors:Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi
Title: Scaling Rich Style-Prompted Text-to-Speech Datasets
Abstract:
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .
中文: 我们推出ParaSpeechCaps大规模语音风格标注数据集,结合人工标注与自动扩展数据,显著提升了语音合成模型的风格一致性和自然度表现。
English: We introduce ParaSpeechCaps, a large-scale dataset with rich style annotations for speech, combining human-labeled and automatically scaled data to significantly improve text-to-speech model performance in style consistency and naturalness.

Authors:Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Title: An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding
Abstract:
This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
中文摘要:本文提出InfoMTL多任务学习框架,通过共享信息最大化和任务特定信息最小化原则,提取噪声不变的充分表征,增强预训练语言模型在多任务场景下的理解能力,在多个基准测试中优于现有方法。
English Summary: The paper introduces InfoMTL, a multi-task learning framework that enhances language models by extracting noise-invariant sufficient representations through shared information maximization and task-specific information minimization, outperforming existing methods in various scenarios.

Authors:Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Pieter-Jan Kindermans, Cordelia Schmid
Title: What Are You Doing? A Closer Look at Controllable Human Video Generation
Abstract:
High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
中文: 该摘要介绍了“你在做什么?”(WYD)基准,这是一个用于评估可控图像到视频生成的数据集,通过提供细粒度类别和自动指标来弥补现有数据集的不足,从而深入分析模型性能。
English: The abstract introduces the "What Are You Doing?" (WYD) benchmark, a comprehensive dataset designed to evaluate controllable image-to-video generation of humans, addressing the limitations of existing datasets by providing fine-grained categories and automatic metrics for in-depth model analysis.

Authors:Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun, Kede Ma, Ying Wei
Title: CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models
Abstract:
The advent of the foundation model era has sparked significant research interest in leveraging pre-trained representations for continual learning (CL), yielding a series of top-performing CL methods on standard evaluation benchmarks. Nonetheless, there are growing concerns regarding potential data contamination during the pre-training stage. Furthermore, standard evaluation benchmarks, which are typically static, fail to capture the complexities of real-world CL scenarios, resulting in saturated performance. To address these issues, we describe CL on dynamic benchmarks (CLDyB), a general computational framework based on Markov decision processes for evaluating CL methods reliably. CLDyB dynamically identifies inherently difficult and algorithm-dependent tasks for the given CL methods, and determines challenging task orders using Monte Carlo tree search. Leveraging CLDyB, we first conduct a joint evaluation of multiple state-of-the-art CL methods, leading to a set of commonly challenging and generalizable task sequences where existing CL methods tend to perform poorly. We then conduct separate evaluations of individual CL methods using CLDyB, discovering their respective strengths and weaknesses. The source code and generated task sequences are publicly accessible at https://github.com/szc12153/CLDyB.
中文: 本研究提出CLDyB动态基准框架,通过马尔可夫决策过程和蒙特卡洛树搜索动态识别困难任务序列,可靠评估持续学习方法,揭示其局限性和优势。
English: The study introduces CLDyB, a dynamic benchmark framework using Markov decision processes and Monte Carlo tree search to reliably evaluate continual learning methods by identifying challenging task sequences, revealing their limitations and strengths.

Authors:Jiang Li, Xiaoping Wang
Title: Joint Masked Reconstruction and Contrastive Learning for Mining Interactions Between Proteins
Abstract:
Protein-protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations, holding significant practical implications for the realms of pharmaceutical development and clinical treatment. Presently, the majority of research methods primarily concentrate on the analysis of amino acid sequences, while investigations predicated on protein structures remain in the nascent stages of exploration. Despite the emergence of several structure-based algorithms in recent years, these are still confronted with inherent challenges: (1) the extraction of intrinsic structural information of proteins typically necessitates the expenditure of substantial computational resources; (2) these models are overly reliant on seen protein data, struggling to effectively unearth interaction cues between unknown proteins. To further propel advancements in this domain, this paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI. This methodology dissects the PPI prediction task into two distinct phases: during the residue structure encoding phase, JmcPPI devises two feature reconstruction tasks and employs graph attention mechanism to capture structural information between residues; during the protein interaction inference phase, JmcPPI perturbs the original PPI graph and employs a multi-graph contrastive learning strategy to thoroughly mine extrinsic interaction information of novel proteins. Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models across various data partition schemes. The associated code can be accessed via https://github.com/lijfrank-open/JmcPPI.
中文: 本文提出了一种结合掩码重建与对比学习的蛋白质相互作用预测方法JmcPPI,通过分阶段处理结构编码和相互作用推断,在多个数据集上展现出优于现有模型的性能表现。
English: This paper introduces JmcPPI, a novel protein-protein interaction prediction method combining masked reconstruction and contrastive learning, which effectively captures structural features and interaction patterns while demonstrating superior performance over existing models across multiple datasets.

Authors:Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang
Title: Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment
Abstract:
Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $\textit{captures}$ learned preferences from well-aligned English models by implicit rewards and $\textit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding
中文摘要:直接偏好优化(DPO)通过从英语对齐模型中提取隐式奖励,将习得的偏好知识迁移至其他语言,实现了无需大量多语言数据的高效跨语言偏好对齐,显著提升了模型性能。
English Summary: Direct Preference Optimization (DPO) enables effective multilingual preference alignment by transferring learned preferences from English-aligned models to other languages through implicit rewards, achieving significant performance improvements without extensive multilingual data.

Authors:Hong Liu, Haosen Yang, Federica Eduati, Josien P. W. Pluim, Mitko Veta
Title: Adaptive Prototype Learning for Multimodal Cancer Survival Analysis
Abstract:
Leveraging multimodal data, particularly the integration of whole-slide histology images (WSIs) and transcriptomic profiles, holds great promise for improving cancer survival prediction. However, excessive redundancy in multimodal data can degrade model performance. In this paper, we propose Adaptive Prototype Learning (APL), a novel and effective approach for multimodal cancer survival analysis. APL adaptively learns representative prototypes in a data-driven manner, reducing redundancy while preserving critical information. Our method employs two sets of learnable query vectors that serve as a bridge between high-dimensional representations and survival prediction, capturing task-relevant features. Additionally, we introduce a multimodal mixed self-attention mechanism to enable cross-modal interactions, further enhancing information fusion. Extensive experiments on five benchmark cancer datasets demonstrate the superiority of our approach over existing methods. The code is available at https://github.com/HongLiuuuuu/APL.
中文: 本文提出的自适应原型学习(APL)方法通过数据驱动方式学习代表性原型并引入跨模态交互,有效降低多模态癌症数据冗余,在五个基准数据集上展现出优越性能。
English: The proposed Adaptive Prototype Learning (APL) method effectively reduces redundancy in multimodal cancer data by adaptively learning prototypes and employing cross-modal interactions, demonstrating superior performance across five benchmark datasets.

Authors:Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong
Title: Simulating the Real World: A Unified Survey of Multimodal Generative Models
Abstract:
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
中文摘要:本综述首次提出多模态生成模型的统一框架,系统梳理从二维到四维生成的演进路径,旨在克服现有方法在模态整合方面的局限,推动人工智能通用智能中的真实世界模拟研究。
English Summary: This survey presents a unified framework for multimodal generative models, systematically reviewing the progression from 2D to 4D generation to advance real-world simulation in AGI research while addressing current limitations in modality integration.

Authors:Hong Liu, Haosen Yang, Evi M. C. Huijben, Mark Schuiveling, Ruisheng Su, Josien P. W. Pluim, Mitko Veta
Title: PathoPainter: Augmenting Histopathology Segmentation via Tumor-aware Inpainting
Abstract:
Tumor segmentation plays a critical role in histopathology, but it requires costly, fine-grained image-mask pairs annotated by pathologists. Thus, synthesizing histopathology data to expand the dataset is highly desirable. Previous works suffer from inaccuracies and limited diversity in image-mask pairs, both of which affect training segmentation, particularly in small-scale datasets and the inherently complex nature of histopathology images. To address this challenge, we propose PathoPainter, which reformulates image-mask pair generation as a tumor inpainting task. Specifically, our approach preserves the background while inpainting the tumor region, ensuring precise alignment between the generated image and its corresponding mask. To enhance dataset diversity while maintaining biological plausibility, we incorporate a sampling mechanism that conditions tumor inpainting on regional embeddings from a different image. Additionally, we introduce a filtering strategy to exclude uncertain synthetic regions, further improving the quality of the generated data. Our comprehensive evaluation spans multiple datasets featuring diverse tumor types and various training data scales. As a result, segmentation improved significantly with our synthetic data, surpassing existing segmentation data synthesis approaches, e.g., 75.69% -> 77.69% on CAMELYON16. The code is available at https://github.com/HongLiuuuuu/PathoPainter.
Chinese: PathoPainter通过将图像-掩码对生成重构为肿瘤修复任务,在保留背景的同时生成多样且生物学可信的肿瘤区域,显著提升了多数据集上的分割性能。
English: PathoPainter enhances tumor segmentation by generating diverse and biologically plausible synthetic image-mask pairs through tumor inpainting, significantly improving segmentation accuracy across multiple datasets.

Authors:Xiangchao Yan, Shiyang Feng, Jiakang Yuan, Renqiu Xia, Bin Wang, Bo Zhang, Lei Bai
Title: SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
Abstract:
Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SurveyForge, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SurveyForge can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
中文摘要:SurveyForge通过分析人工撰写的大纲结构并参考检索的领域文献来生成大纲,利用学者导航代理检索高质量论文自动生成并优化内容,实验证明其性能优于AutoSurvey等先前工作。
English Summary: SurveyForge is introduced to bridge the quality gap in LLM-generated surveys by analyzing human-written outlines and leveraging retrieved scholarly articles, with experiments showing it outperforms prior methods like AutoSurvey.

Authors:Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang
Title: The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Abstract:
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
中文:LanDiff是一种结合自回归语言模型与扩散模型优势的混合文本到视频框架,通过从粗到细的生成方式克服了两者固有缺陷,在标准及长视频生成基准测试中均实现了最先进的性能表现。
English: LanDiff is a hybrid text-to-video framework that combines autoregressive language models and diffusion models to overcome their individual limitations, achieving state-of-the-art performance in both standard and long video generation benchmarks.

Authors:Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Title: HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Abstract:
Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the position of layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence demonstrating that HybridNorm improves gradient flow and model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.
中文: 本文提出HybridNorm混合归一化策略,通过结合Pre-Norm和Post-Norm的优势来改善深度Transformer的梯度流动和鲁棒性,在多个基准测试中持续优于现有方法。
English: The paper introduces HybridNorm, a hybrid normalization strategy that combines Pre-Norm and Post-Norm to enhance gradient flow and robustness in deep transformers, consistently outperforming existing methods across benchmarks.

Authors:Qing Zhou, Tao Yang, Junyu Gao, Weiping Ni, Junzheng Wu, Qi Wang
Title: A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning
Abstract:
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery. Despite significant advances in developing sophisticated methods and large-scale datasets for training vision-language models (VLMs), two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models. These limitations fundamentally impede the progress and practical deployment of RSIC, particularly in the era of large VLMs. To address these challenges, this paper presents several significant contributions to the field. First, we introduce and analyze BRSIC (Bilingual Remote Sensing Image Captioning), a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions. Building upon this foundation, we develop a systematic evaluation framework that addresses the prevalent inconsistency in evaluation protocols, enabling rigorous assessment of model performance through standardized retraining procedures on BRSIC. Furthermore, we present an extensive empirical study of eight state-of-the-art large vision-language models (LVLMs), examining their capabilities across multiple paradigms including zero-shot inference, supervised fine-tuning, and multi-lingual training. This comprehensive evaluation provides crucial insights into the strengths and limitations of current LVLMs in handling multilingual remote sensing tasks. Additionally, our cross-dataset transfer experiments reveal interesting findings. The code and data will be available at https://github.com/mrazhou/BRSIC.
中文摘要:本文提出了BRSIC双语遥感图像描述数据集,构建了标准化评估框架系统评估大视觉语言模型的多语言能力,解决了该领域缺乏多语言数据和统一评估标准的关键瓶颈。
English Summary: This paper introduces BRSIC, a bilingual dataset with Chinese and English captions for remote sensing images, and establishes a standardized evaluation framework to assess multilingual capabilities of large vision-language models, addressing key limitations in the field.

Authors:Yibin Wu, Jian Kuang, Shahram Khorshidi, Xiaoji Niu, Lasse Klingbeil, Maren Bennewitz, Heiner Kuhlmann
Title: DogLegs: Robust Proprioceptive State Estimation for Legged Robots Using Multiple Leg-Mounted IMUs
Abstract:
Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders, and multiple leg-mounted IMUs (Leg-IMU) using an extended Kalman filter (EKF). The filter system contains the error states of all IMU frames. The Leg-IMUs are used to detect foot contact, thereby providing zero-velocity measurements to update the state of the Leg-IMU frames. Additionally, we compute the relative position constraints between the Body-IMU and Leg-IMUs by the leg kinematics and use them to update the main body state and reduce the error drift of the individual IMU frames. Field experimental results have shown that our proposed DogLegs system achieves better state estimation accuracy compared to the traditional leg odometry method (using only Body-IMU and joint encoders) across various terrains. We make our datasets publicly available to benefit the research community (https://github.com/YibinWu/leg-odometry).
中文:DogLegs是一种用于腿式机器人的鲁棒状态估计系统,通过扩展卡尔曼滤波器融合身体和腿部IMU与关节编码器数据,相比传统方法显著提升了多种地形下的精度。
English: DogLegs is a robust state estimation system for legged robots that integrates body and leg IMUs with joint encoders via an extended Kalman filter, significantly improving accuracy across diverse terrains compared to traditional methods.

Authors:Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, Kailun Yang
Title: Omnidirectional Multi-Object Tracking
Abstract:
Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in panoramic field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset--a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as panoramic fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The established dataset and source code are available at https://github.com/xifen523/OmniTrack.
中文摘要:OmniTrack是一种创新的全景多目标跟踪框架,通过整合轨迹管理、灵活跟踪实例和环形状态模块,有效解决了全景图像失真问题,并在JRDB和新型QuadTrack数据集上实现了最先进的性能表现。
English Summary: OmniTrack is a novel omnidirectional multi-object tracking framework designed to overcome panoramic image distortions and the limitations of traditional tracking methods, achieving state-of-the-art performance on both JRDB and the newly introduced QuadTrack dataset.

Authors:Armel Zebaze, Benoît Sagot, Rachel Bawden
Title: Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Abstract:
The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19. Code and outputs are available at https://github.com/ArmelRandy/compositional-translation
中文:本文提出组合式翻译这一新型基于大语言模型的翻译范式,通过将句子分解为更简单的短语并借助检索示例进行翻译,显著提升了在低资源场景下多个机器翻译基准测试的性能表现。
English: This paper introduces compositional translation, a novel LLM-based approach that decomposes sentences into simpler phrases for translation using retrieved demonstrations, significantly enhancing performance across multiple machine translation benchmarks especially in low-resource settings.

Authors:Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Title: An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Abstract:
In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.
中文: 本报告介绍了STILL项目中慢思考模型的进展,通过强化学习训练有效提升了Qwen2.5-32B模型的响应长度与测试准确率,并将DeepSeek-R1-Distill-Qwen-1.5B模型在AIME 2024上的准确率优化至39.33%,同时工具操作技术更使推理准确率达到86.67%,相关资源已发布于GitHub平台。
English: This report details the STILL project's progress in developing slow-thinking models through scaled RL training, which consistently enhances model performance, including boosting the Qwen2.5-32B's response length and accuracy and refining the DeepSeek-R1-Distill-Qwen-1.5B to 39.33% on AIME 2024, while tool manipulation further achieves 86.67% accuracy, with resources available on GitHub.

Authors:Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du
Title: Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Abstract:
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: https://github.com/WenkeHuang/Awesome-MLLM-Tuning.
中文: 本文系统综述了多模态大语言模型的微调方法,通过对比基准测试解决任务专精与知识保持等挑战,并提出了未来研究方向。
English: This paper systematically reviews multi-modal large language model tuning methodologies, addressing challenges like task specialization and knowledge retention through comparative benchmarking and proposing future research directions.

Authors:Matias Cosarinsky, Ramiro Billot, Lucas Mansilla, Gabriel Jimenez, Nicolas Gaggión, Guanghui Fu, Tom Tirer, Enzo Ferrante
Title: Conformal In-Context Reverse Classification Accuracy: Efficient Estimation of Segmentation Quality with Statistical Guarantees
Abstract:
Assessing the quality of automatic image segmentation is crucial in clinical practice, but often very challenging due to the limited availability of ground truth annotations. Reverse Classification Accuracy (RCA) is an approach that estimates the quality of new predictions on unseen samples by training a segmenter on those predictions, and then evaluating it against existing annotated images. In this work, we introduce Conformal In-Context RCA, a novel method for automatically estimating segmentation quality with statistical guarantees in the absence of ground-truth annotations, which consists of two main innovations. First, In-Context RCA, which leverages recent in-context learning models for image segmentation and incorporates retrieval-augmentation techniques to select the most relevant reference images. This approach enables efficient quality estimation with minimal reference data while avoiding the need of training additional models. Second, we introduce Conformal RCA, which extends both the original RCA framework and In-Context RCA to go beyond point estimation. Using tools from split conformal prediction, Conformal RCA produces prediction intervals for segmentation quality providing statistical guarantees that the true score lies within the estimated interval with a user-specified probability. Validated across 10 different medical imaging tasks in various organs and modalities, our methods demonstrate robust performance and computational efficiency, offering a promising solution for automated quality control in clinical workflows, where fast and reliable segmentation assessment is essential. The code is available at https://github.com/mcosarinsky/Conformal-In-Context-RCA.
中文: 本文提出Conformal In-Context RCA新方法,结合情境学习和保形预测技术,能够在缺乏真实标注的情况下自动评估医学图像分割质量并提供统计保证,经多临床任务验证具备高效可靠的自动化质量控制能力。
English: This paper introduces Conformal In-Context RCA, a novel method that leverages in-context learning and conformal prediction to automatically estimate medical image segmentation quality with statistical guarantees, validated across diverse clinical tasks for reliable automated quality control.

Authors:Benjamin Billot, Ramya Muthukrishnan, Esra Abaci-Turk, P. Ellen Grant, Nicholas Ayache, Hervé Delingette, Polina Golland
Title: Spatial regularisation for improved accuracy and interpretability in keypoint-based registration
Abstract:
Unsupervised registration strategies bypass requirements in ground truth transforms or segmentations by optimising similarity metrics between fixed and moved volumes. Among these methods, a recent subclass of approaches based on unsupervised keypoint detection stand out as very promising for interpretability. Specifically, these methods train a network to predict feature maps for fixed and moving images, from which explainable centres of mass are computed to obtain point clouds, that are then aligned in closed-form. However, the features returned by the network often yield spatially diffuse patterns that are hard to interpret, thus undermining the purpose of keypoint-based registration. Here, we propose a three-fold loss to regularise the spatial distribution of the features. First, we use the KL divergence to model features as point spread functions that we interpret as probabilistic keypoints. Then, we sharpen the spatial distributions of these features to increase the precision of the detected landmarks. Finally, we introduce a new repulsive loss across keypoints to encourage spatial diversity. Overall, our loss considerably improves the interpretability of the features, which now correspond to precise and anatomically meaningful landmarks. We demonstrate our three-fold loss in foetal rigid motion tracking and brain MRI affine registration tasks, where it not only outperforms state-of-the-art unsupervised strategies, but also bridges the gap with state-of-the-art supervised methods. Our code is available at https://github.com/BenBillot/spatial_regularisation.
中文摘要:本文提出了一种三重损失函数,通过锐化特征分布和促进空间多样性,显著提升了无监督关键点图像配准的可解释性,在医学影像任务中实现了与监督方法相媲美的性能。
English Summary: This paper introduces a three-fold loss function to enhance the interpretability of unsupervised keypoint-based image registration by sharpening feature distributions and promoting spatial diversity, achieving performance comparable to supervised methods in medical imaging tasks.

Authors:Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
Title: Generalized Interpolating Discrete Diffusion
Abstract:
While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/
中文: 本文提出了一种新的通用插值离散扩散(GIDD)模型系列,通过灵活的噪声处理设计克服了现有方法的修订限制,实现了最先进的性能并具备了自我纠错能力。
English: This paper introduces a new family of general interpolating discrete diffusion (GIDD) models that overcome the revision limitations of existing approaches by offering flexible noising processes, achieving state-of-the-art performance and enabling self-correction capabilities.

Authors:Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, Shiguo Lian
Title: DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
Abstract:
Recent advancements in slow thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking (generating redundant reasoning steps for simple problems), leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought (CoT) based on problem difficulty. We first propose a Token Length Budget (TLB) metric to quantify difficulty, then leverage budget-aware reward shaping and budget preference optimization to implement DAST. DAST penalizes overlong responses for simple tasks while incentivizing sufficient reasoning for complex problems. Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking (reducing token usage by over 30\% on average) while preserving reasoning accuracy on complex problems. Our codes and models are available at https://github.com/AnonymousUser0520/AnonymousRepo01.
中文摘要:难度自适应慢思考(DAST)框架通过基于问题难度自主调整推理链长度,在保持复杂问题求解精度的同时,有效减少超过30%的冗余推理消耗。
English Summary: The Difficulty-Adaptive Slow Thinking (DAST) framework enables models to autonomously adjust reasoning length based on problem difficulty, effectively reducing overthinking by over 30% in token usage while maintaining accuracy on complex tasks.

Authors:Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, Sungeun Hong
Title: Question-Aware Gaussian Experts for Audio-Visual Question Answering
Abstract:
Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://aim-skku.github.io/QA-TIGER/
中文摘要:本文提出QA-TIGER框架,通过显式结合问题信息、采用高斯建模和专家混合方法模拟连续时间动态,在多个视听问答基准测试中实现了最先进的性能。
English Summary: The paper introduces QA-TIGER, a framework that explicitly integrates question information and models continuous temporal dynamics using Gaussian-based modeling and Mixture of Experts to achieve state-of-the-art performance on AVQA benchmarks.

Authors:Yijie Xu, Bolun Zheng, Wei Zhu, Hangjia Pan, Yuchen Yao, Ning Xu, Anan Liu, Quan Zhang, Chenggang Yan
Title: SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
Abstract:
Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In this paper, with exploring YouTube's multilingual and multi-modal content, we construct a new social media temporal popularity prediction benchmark, namely SMTPD, and suggest a baseline framework for temporal popularity prediction. Through data analysis and experiments, we verify that temporal alignment and early popularity play crucial roles in social media popularity prediction for not only deepening the understanding of temporal dynamics of popularity in social media but also offering a suggestion about developing more effective prediction models in this field. Code is available at https://github.com/zhuwei321/SMTPD.
中文: 本研究提出了SMTPD这一社交媒体时效性流行度预测新基准,证实时间对齐和早期流行度对理解及优化预测模型至关重要。
English: This study introduces SMTPD, a new benchmark for temporal popularity prediction on social media, demonstrating that temporal alignment and early popularity are critical for understanding and improving prediction models.

Authors:Leonardo Kuffo, Elena Krippner, Peter Boncz
Title: PDX: A Data Layout for Vector Similarity Search
Abstract:
We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.
中文: PDX是一种向量数据布局,通过逐维度处理加速相似性搜索,并与剪枝算法协同实现2-7倍性能提升,而PDX-BOND无需预处理即可实现灵活剪枝。
English: PDX is a vector data layout that accelerates similarity search through dimension-by-dimension processing and synergizes with pruning algorithms to achieve 2-7x speedup, while PDX-BOND offers flexible pruning without preprocessing.

Authors:Hyunwoo Yoo
Title: Can Large Language Models Predict Antimicrobial Resistance Gene?
Abstract:
This study demonstrates that generative large language models can be utilized in a more flexible manner for DNA sequence analysis and classification tasks compared to traditional transformer encoder-based models. While recent encoder-based models such as DNABERT and Nucleotide Transformer have shown significant performance in DNA sequence classification, transformer decoder-based generative models have not yet been extensively explored in this field. This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels and analyzes performance changes when additional textual information is provided. Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions, demonstrating flexibility and accuracy when incorporating both sequence and textual information. The code and data used in this work are available at the following GitHub repository: https://github.com/biocomgit/llm4dna.
Chinese: 本研究表明,在DNA序列分析中,生成式大语言模型比传统的编码器模型更具灵活性,结合序列与文本信息时能提供相当甚至更优的预测准确性。
English: This study shows that generative large language models offer greater flexibility and comparable or better accuracy than traditional encoder-based models for DNA sequence analysis, especially when integrating both sequence and textual data.

Authors:Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
Title: More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Abstract:
Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
中文: 在检索增强生成中,即使上下文长度保持不变,增加文档数量仍对大型语言模型构成显著挑战,且处理多文档与应对长上下文属于不同的难题。
English: Increasing the number of documents in retrieval-augmented generation significantly challenges large language models, even with constant context length, and processing multiple documents presents a distinct difficulty from managing long contexts.

Authors:Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik
Title: TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge
Abstract:
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.
中文: TRACT方法通过两阶段微调结合思维链推理与回归感知训练,在多项LLM作为评判者的数据集上显著超越了现有方法。
English: TRACT introduces a two-stage fine-tuning method combining chain-of-thought reasoning with regression-aware training, significantly outperforming existing approaches in LLM-as-a-judge evaluations across multiple datasets.

Authors:Antonio Guillén-Teruel, Marcos Caracena, Jose A. Pardo, Fernando de-la-Gándara, José Palma, Juan A. Botía
Title: FILM: Framework for Imbalanced Learning Machines based on a new unbiased performance measure and a new ensemble-based technique
Abstract:
This research addresses the challenges of handling unbalanced datasets for binary classification tasks. In such scenarios, standard evaluation metrics are often biased by the disproportionate representation of the minority class. Conducting experiments across seven datasets, we uncovered inconsistencies in evaluation metrics when determining the model that outperforms others for each binary classification problem. This justifies the need for a metric that provides a more consistent and unbiased evaluation across unbalanced datasets, thereby supporting robust model selection. To mitigate this problem, we propose a novel metric, the Unbiased Integration Coefficients (UIC), which exhibits significantly reduced bias ($p < 10^{-4}$) towards the minority class compared to conventional metrics. The UIC is constructed by aggregating existing metrics while penalising those more prone to imbalance. In addition, we introduce the Identical Partitions for Imbalance Problems (IPIP) algorithm for imbalanced ML problems, an ensemble-based approach. Our experimental results show that IPIP outperforms other baseline imbalance-aware approaches using Random Forest and Logistic Regression models in three out of seven datasets as assessed by the UIC metric, demonstrating its effectiveness in addressing imbalanced data challenges in binary classification tasks. This new framework for dealing with imbalanced datasets is materialized in the FILM (Framework for Imbalanced Learning Machines) R Package, accessible at https://github.com/antoniogt/FILM.
Chinese: 本研究针对不平衡二元分类中的评估偏差问题,提出了无偏积分系数(UIC)指标和针对不平衡问题的同质分区(IPIP)算法,通过实验验证了其有效性,并在FILM R软件包中实现了该框架。
English: This study introduces the Unbiased Integration Coefficients (UIC) metric and the Identical Partitions for Imbalance Problems (IPIP) algorithm to address evaluation inconsistencies in imbalanced binary classification, demonstrating their effectiveness through experiments and implementing them in the FILM R package.

Authors:Yafu Li, Ronghao Zhang, Zhilin Wang, Huajian Zhang, Leyang Cui, Yongjing Yin, Tong Xiao, Yue Zhang
Title: Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
Abstract:
Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at https://github.com/yafuly/LLM_Translationese.
中文: 大语言模型在监督微调中产生的偏差导致译文生硬不自然,但通过优化参考译文和筛选训练数据等方法,可显著减少翻译腔并提升译文的流畅度。
English: Large language models often produce unnatural, literal translations known as translationese due to biases from supervised fine-tuning, but proposed methods like polishing references and filtering training data effectively reduce these errors and enhance translation naturalness.

Authors:Chanda Grover Kamra, Indra Deep Mastan, Debayan Gupta
Title: ObjMST: An Object-Focused Multimodal Style Transfer Framework
Abstract:
We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.
Chinese: ObjMST是一种多模态风格迁移框架,通过引入风格特定掩码定向CLIP损失和显著对象到关键映射机制,分别对显著物体和背景元素进行风格监督,有效解决了多模态表示学习中的对齐和内容不匹配问题。
English: ObjMST is a multimodal style transfer framework that addresses alignment and content mismatch issues by providing separate style supervision for salient objects and surroundings through a Style-Specific Masked Directional CLIP Loss and a salient-to-key mapping mechanism.

Authors:Cecilia Diana-Albelda, Roberto Alcover-Couso, Álvaro García-Martín, Jesus Bescos, Marcos Escudero-Viñolo
Title: GBT-SAM: Adapting a Foundational Deep Learning Model for Generalizable Brain Tumor Segmentation via Efficient Integration of Multi-Parametric MRI Data
Abstract:
Gliomas are aggressive brain tumors that require accurate imaging-based diagnosis, with segmentation playing a critical role in evaluating morphology and treatment decisions. Manual delineation of gliomas is time-consuming and prone to variability, motivating the use of deep learning to improve consistency and alleviate clinical workload. However, existing methods often fail to fully exploit the information available in multi-parametric MRI (mp-MRI), particularly inter-slice contextual features, and typically require considerable computational resources while lacking robustness across tumor type variations. We present GBT-SAM, a parameter-efficient deep learning framework that adapts the Segment Anything Model (SAM), a large-scale vision model, to volumetric mp-MRI data. GBT-SAM reduces input complexity by selecting fewer than 2.6\% of slices per scan while incorporating all four MRI modalities, preserving essential tumor-related information with minimal cost. Furthermore, our model is trained by a two-step fine-tuning strategy that incorporates a depth-aware module to capture inter-slice correlations and lightweight adaptation layers, resulting in just 6.5M trainable parameters, which is the lowest among SAM-based approaches. GBT-SAM achieves a Dice Score of 93.54 on the BraTS Adult Glioma dataset and demonstrates robust performance on Meningioma, Pediatric Glioma, and Sub-Saharan Glioma datasets. These results highlight GBT-SAM's potential as a computationally efficient and domain-robust framework for brain tumor segmentation using mp-MRI. Our code and models are available at https://github.com/vpulab/med-sam-brain .
中文: GBT-SAM是一种参数高效的深度学习框架,通过改进Segment Anything模型实现多参数MRI中的脑肿瘤分割,仅需少量计算资源即可获得高精度,并在多种肿瘤类型上表现出稳定的性能。
English: GBT-SAM is a parameter-efficient deep learning framework that adapts the Segment Anything Model for brain tumor segmentation in multi-parametric MRI, achieving high accuracy with minimal computational resources and demonstrating robust performance across various tumor types.

Authors:Lars Bredereke, Yale Hartmann, Tanja Schultz
Title: A Modular Pipeline for 3D Object Tracking Using RGB Cameras
Abstract:
Object tracking is a key challenge of computer vision with various applications that all require different architectures. Most tracking systems have limitations such as constraining all movement to a 2D plane and they often track only one object. In this paper, we present a new modular pipeline that calculates 3D trajectories of multiple objects. It is adaptable to various settings where multiple time-synced and stationary cameras record moving objects, using off the shelf webcams. Our pipeline was tested on the Table Setting Dataset, where participants are recorded with various sensors as they set a table with tableware objects. We need to track these manipulated objects, using 6 rgb webcams. Challenges include: Detecting small objects in 9.874.699 camera frames, determining camera poses, discriminating between nearby and overlapping objects, temporary occlusions, and finally calculating a 3D trajectory using the right subset of an average of 11.12.456 pixel coordinates per 3-minute trial. We implement a robust pipeline that results in accurate trajectories with covariance of x,y,z-position as a confidence metric. It deals dynamically with appearing and disappearing objects, instantiating new Extended Kalman Filters. It scales to hundreds of table-setting trials with very little human annotation input, even with the camera poses of each trial unknown. The code is available at https://github.com/LarsBredereke/object_tracking
中文: 本文提出了一种模块化流程,利用固定网络摄像头计算多个物体的三维轨迹,能有效应对小物体检测和遮挡等挑战,且所需人工标注极少。
English: This paper introduces a modular pipeline for calculating 3D trajectories of multiple objects using stationary webcams, effectively handling challenges like small object detection and occlusions with minimal human input.

Authors:Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang
Title: An Egocentric Vision-Language Model based Portable Real-time Smart Assistant
Abstract:
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.
中文: Vinci 是一款便携式视觉语言系统,通过结合 EgoVideo-VL 模型及记忆、生成和检索模块,可在多种设备上提供实时AI辅助,支持场景理解和技能学习等任务。
English: Vinci is a portable vision-language system that integrates the EgoVideo-VL model with memory, generation, and retrieval modules to deliver real-time AI assistance for tasks like scene understanding and skill acquisition across various devices.

Authors:Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
Title: Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques
Abstract:
Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at https://github.com/cuong-dm/IGNITE
离线优化通过静态数据学习代理模型以降低在线实验成本,但存在分布外不准确性问题,我们提出的模型无关锐度正则化方法有效缓解了此问题,理论支持且性能提升高达9.6%。
Offline optimization reduces online experimentation costs by learning a surrogate model from static data, but it faces out-of-distribution inaccuracies, which our model-agnostic sharpness regularization method effectively mitigates, supported by theory and up to 9.6% performance gains.

Authors:Yi Xiao, Qiangqiang Yuan, Kui Jiang, Wenke Huang, Qiang Zhang, Tingting Zheng, Chia-Wen Lin, Liangpei Zhang
Title: Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks
Abstract:
Spiking neural networks (SNNs) are emerging as a promising alternative to traditional artificial neural networks (ANNs), offering biological plausibility and energy efficiency. Despite these merits, SNNs are frequently hampered by limited capacity and insufficient representation power, yet remain underexplored in remote sensing super-resolution (SR) tasks. In this paper, we first observe that spiking signals exhibit drastic intensity variations across diverse textures, highlighting an active learning state of the neurons. This observation motivates us to apply SNNs for efficient SR of RSIs. Inspired by the success of attention mechanisms in representing salient information, we devise the spiking attention block (SAB), a concise yet effective component that optimizes membrane potentials through inferred attention weights, which, in turn, regulates spiking activity for superior feature representation. Our key contributions include: 1) we bridge the independent modulation between temporal and channel dimensions, facilitating joint feature correlation learning, and 2) we access the global self-similar patterns in large-scale remote sensing imagery to infer spatial attention weights, incorporating effective priors for realistic and faithful reconstruction. Building upon SAB, we proposed SpikeSR, which achieves state-of-the-art performance across various remote sensing benchmarks such as AID, DOTA, and DIOR, while maintaining high computational efficiency. Code of SpikeSR will be available at https://github.com/XY-boy/SpikeSR.
Chinese Summary: 脉冲神经网络通过创新的脉冲注意力模块优化膜电位并整合时空-通道关联,在遥感图像超分辨率任务中实现了高效且领先的性能。
English Summary: Spiking neural networks (SNNs) are leveraged for remote sensing super-resolution through a novel spiking attention block that optimizes membrane potentials and integrates temporal-channel correlations, achieving state-of-the-art results with high efficiency.

Authors:Ziyi Yang, Fanqi Wan, Longguang Zhong, Canbin Huang, Guosheng Liang, Xiaojun Quan
Title: FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
Abstract:
We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Our source models include the powerful Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along with two ultra-compact options, Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source models, we develop a specialized data construction protocol tailored to various tasks and domains. The FuseChat-3.0 training pipeline consists of two key stages: (1) supervised fine-tuning (SFT) to align the target and source model distributions, and (2) Direct Preference Optimization (DPO) to apply preferences from multiple source LLMs to fine-tune the target model. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and 30.1 points on the instruction-following benchmarks AlpacaEval-2 and Arena-Hard, respectively. Our code, models, and datasets are available at https://github.com/SLIT-AI/FuseChat-3.0.
中文:FuseChat-3.0通过两阶段训练流程将多个大型源模型的优势融合到更紧凑的目标模型中,在多项基准测试中实现了显著的性能提升。
English: FuseChat-3.0 integrates the strengths of multiple large source models into smaller target models through a two-stage training process, achieving significant performance improvements across various benchmarks.

Authors:Haitao Wu, Qing Li, Changqing Zhang, Zhen He, Xiaomin Ying
Title: Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior
Abstract:
Can our brain signals faithfully reflect the original visual stimuli, even including high-frequency details? Although human perceptual and cognitive capacities enable us to process and remember visual information, these abilities are constrained by several factors, such as limited attentional resources and the finite capacity of visual memory. When visual stimuli are processed by human visual system into brain signals, some information is inevitably lost, leading to a discrepancy known as the \textbf{System GAP}. Additionally, perceptual and cognitive dynamics, along with technical noise in signal acquisition, degrade the fidelity of brain signals relative to the visual stimuli, known as the \textbf{Random GAP}. When encoded brain representations are directly aligned with the corresponding pretrained image features, the System GAP and Random GAP between paired data challenge the model, requiring it to bridge these gaps. However, in the context of limited paired data, these gaps are difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the \textbf{Uncertainty-aware Blur Prior (UBP)}. It estimates the uncertainty within the paired data, reflecting the mismatch between brain signals and visual stimuli. Based on this uncertainty, UBP dynamically blurs the high-frequency details of the original images, reducing the impact of the mismatch and improving alignment. Our method achieves a top-1 accuracy of \textbf{50.9\%} and a top-5 accuracy of \textbf{79.7\%} on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by margins of \textbf{13.7\%} and \textbf{9.8\%}, respectively. Code is available at \href{https://github.com/HaitaoWuTJU/Uncertainty-aware-Blur-Prior}{GitHub}.
Chinese: 该研究提出了一种不确定性感知模糊先验方法,通过基于不确定性动态模糊原始图像的高频细节,弥合了脑信号与视觉刺激之间的系统性和随机性差距,在脑信号到图像的检索任务中取得了顶尖的准确率。
English: The study introduces an Uncertainty-aware Blur Prior method to bridge the System and Random Gaps between brain signals and visual stimuli, achieving state-of-the-art accuracy in brain-to-image retrieval by dynamically blurring high-frequency details based on uncertainty.

Authors:Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, Xunliang Cai
Title: The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Abstract:
Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning. Our benchmark and code would be available at \href{https://github.com/Yufang-Liu/visual_modality_role}{https://github.com/Yufang-Liu/visual\_modality\_role}.
Chinese: 当前多模态数学模型未能充分利用视觉信息,为此引入HC-M3D数据集以强化图像依赖性并揭示视觉感知的局限性,同时通用的视觉问答能力提升对数学推理并无助益。
English: Current multimodal mathematical models underutilize visual information, prompting the introduction of the HC-M3D dataset to enforce image reliance and reveal limitations in visual perception, while general VQA enhancements fail to improve math reasoning performance.

Authors:Ziqiang Cui, Yunpeng Weng, Xing Tang, Xiaokun Zhang, Shiwei Li, Peiyang Liu, Bowei He, Dugang Liu, Weihong Luo, Xiuqiang He, Chen Ma
Title: SRA-CL: Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation
Abstract:
Contrastive learning has shown effectiveness in improving sequential recommendation models. However, existing methods still face challenges in generating high-quality contrastive pairs: they either rely on random perturbations that corrupt user preference patterns or depend on sparse collaborative data that generates unreliable contrastive pairs. Furthermore, existing approaches typically require predefined selection rules that impose strong assumptions, limiting the model's ability to autonomously learn optimal contrastive pairs. To address these limitations, we propose a novel approach named Semantic Retrieval Augmented Contrastive Learning (SRA-CL). SRA-CL leverages the semantic understanding and reasoning capabilities of LLMs to generate expressive embeddings that capture both user preferences and item characteristics. These semantic embeddings enable the construction of candidate pools for inter-user and intra-user contrastive learning through semantic-based retrieval. To further enhance the quality of the contrastive samples, we introduce a learnable sample synthesizer that optimizes the contrastive sample generation process during model training. SRA-CL adopts a plug-and-play design, enabling seamless integration with existing sequential recommendation architectures. Extensive experiments on four public datasets demonstrate the effectiveness and model-agnostic nature of our approach.
中文: 序列推荐面临数据稀疏性挑战,对比学习通过构建正样本对来缓解此问题,但现有方法难以保证样本可靠性,因此提出SRA-CL方法,利用大语言模型的语义信息,通过跨序列和序列内对比学习提升样本质量。
English: Sequential recommendation faces data sparsity challenges, which contrastive learning addresses by creating positive sample pairs, but current methods struggle with reliability, leading to the proposed SRA-CL approach that uses semantic information from LLMs to enhance sample quality through cross-sequence and intra-sequence contrastive learning.

Authors:Yuan Liao, Yuhong Zhang, Qiushi Han, Yuhang Yang, Weiwei Ding, Yuzhe Gu, Hengxin Yang, Liya Huang
Title: Frequency-Based Alignment of EEG and Audio Signals Using Contrastive Learning and SincNet for Auditory Attention Detection
Abstract:
Humans exhibit a remarkable ability to focus auditory attention in complex acoustic environments, such as cocktail parties. Auditory attention detection (AAD) aims to identify the attended speaker by analyzing brain signals, such as electroencephalography (EEG) data. Existing AAD algorithms often leverage deep learning's powerful nonlinear modeling capabilities, few consider the neural mechanisms underlying auditory processing in the brain. In this paper, we propose SincAlignNet, a novel network based on an improved SincNet and contrastive learning, designed to align audio and EEG features for auditory attention detection. The SincNet component simulates the brain's processing of audio during auditory attention, while contrastive learning guides the model to learn the relationship between EEG signals and attended speech. During inference, we calculate the cosine similarity between EEG and audio features and also explore direct inference of the attended speaker using EEG data. Cross-trial evaluations results demonstrate that SincAlignNet outperforms state-of-the-art AAD methods on two publicly available datasets, KUL and DTU, achieving average accuracies of 78.3% and 92.2%, respectively, with a 1-second decision window. The model exhibits strong interpretability, revealing that the left and right temporal lobes are more active during both male and female speaker scenarios. Furthermore, we found that using data from only six electrodes near the temporal lobes maintains similar or even better performance compared to using 64 electrodes. These findings indicate that efficient low-density EEG online decoding is achievable, marking an important step toward the practical implementation of neuro-guided hearing aids in real-world applications. Code is available at: https://github.com/LiaoEuan/SincAlignNet.
中文: SincAlignNet结合改进的SincNet与对比学习,通过对齐音频和脑电特征实现听觉注意检测,在公开数据集上超越现有最优方法,同时证明低密度脑电可实现高效解码,为神经引导助听器的实际应用迈出关键一步。
English: SincAlignNet, a novel network combining improved SincNet and contrastive learning, effectively aligns audio and EEG features to detect auditory attention, outperforming state-of-the-art methods on public datasets while enabling efficient low-density EEG decoding for practical hearing aid applications.

Authors:Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu
Title: Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Abstract:
Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in multi-view unsupervised clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate RML's effectiveness. Code is available at https://github.com/SubmissionsIn/RML.
中文:提出的鲁棒多视角学习方法RML通过基于Transformer的融合网络整合异构视角,并利用模拟扰动对比学习增强表征鲁棒性,在多种无监督和噪声标签任务中验证了其有效性。
English: The proposed robust multi-view learning method, RML, integrates heterogeneous views through a transformer-based fusion network and enhances representation robustness via simulated perturbation contrastive learning, demonstrating effectiveness in various unsupervised and noisy-label tasks.

Authors:Senming Tan, Zhenyu Hou, Zhihao Zhang, Long Xu, Mengke Zhang, Zhaoqi He, Chao Xu, Fei Gao, Yanjun Cao
Title: Real-time Spatial-temporal Traversability Assessment via Feature-based Sparse Gaussian Process
Abstract:
Terrain analysis is critical for the practical application of ground mobile robots in real-world tasks, especially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high-resolution local traversability map. Then, we design a spatial-temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outperforms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor environments. Our code will be open-source for further research and development by the community, https://github.com/ZJU-FAST-Lab/FSGP_BGK.
本文提出了一种新颖的时空可通行性评估方法,结合稀疏高斯过程和贝叶斯推理,实现了复杂户外地形中的实时自主导航,在精度和计算效率上均优于现有方法。
This paper introduces a novel spatial-temporal traversability assessment method using sparse Gaussian processes and Bayesian inference to enable real-time autonomous navigation in complex outdoor terrains, demonstrating superior accuracy and efficiency over existing approaches.

Authors:Haoran Wang, Lian Huai, Wenbin Li, Lei Qi, Xingqun Jiang, Yinghuan Shi
Title: WeakMedSAM: Weakly-Supervised Medical Image Segmentation via SAM with Sub-Class Exploration and Prompt Affinity Mining
Abstract:
We have witnessed remarkable progress in foundation models in vision tasks. Currently, several recent works have utilized the segmenting anything model (SAM) to boost the segmentation performance in medical images, where most of them focus on training an adaptor for fine-tuning a large amount of pixel-wise annotated medical images following a fully supervised manner. In this paper, to reduce the labeling cost, we investigate a novel weakly-supervised SAM-based segmentation model, namely WeakMedSAM. Specifically, our proposed WeakMedSAM contains two modules: 1) to mitigate severe co-occurrence in medical images, a sub-class exploration module is introduced to learn accurate feature representations. 2) to improve the quality of the class activation maps, our prompt affinity mining module utilizes the prompt capability of SAM to obtain an affinity map for random-walk refinement. Our method can be applied to any SAM-like backbone, and we conduct experiments with SAMUS and EfficientSAM. The experimental results on three popularly-used benchmark datasets, i.e., BraTS 2019, AbdomenCT-1K, and MSD Cardiac dataset, show the promising results of our proposed WeakMedSAM. Our code is available at https://github.com/wanghr64/WeakMedSAM.
中文摘要:本文提出WeakMedSAM这一新型弱监督分割模型,通过子类探索模块解决医学图像共现问题,并利用提示亲和力挖掘模块提升类别激活图质量,在三个主流医学数据集上验证了其降低标注成本的有效性。
English Summary: This paper introduces WeakMedSAM, a novel weakly-supervised segmentation model that reduces labeling costs in medical imaging by incorporating a sub-class exploration module to address co-occurrence issues and a prompt affinity mining module to enhance class activation maps using SAM's capabilities.

Authors:Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen
Title: Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.
中文: 大语言模型在处理方言时存在显著偏见,与非裔美国英语相比,标准美国英语的输入导致其推理准确性下降且逻辑链条更为简化,尤其在社会科学和人文领域表现明显。
English: Large Language Models exhibit significant dialectal bias, performing less accurately and with simpler reasoning for African American English compared to Standard American English, particularly in social sciences and humanities.

Authors:Beverley Gorry, Tobias Fischer, Michael Milford, Alejandro Fontan
Title: Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments
Abstract:
Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark-the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories, arbitrary overlap and diverse seafloor types captured under varying environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc
中文: 提出的集成流程结合了视觉地点识别、特征匹配和图像分割技术,能够可靠识别水下区域并分析生态系统变化,同时引入了利用多平台广泛数据的SQUIDLE+新型基准数据集。
English: The proposed integrated pipeline combining Visual Place Recognition, feature matching, and image segmentation enables robust underwater area identification and ecosystem change analysis, supported by the novel SQUIDLE+ benchmark leveraging extensive multi-platform data.

Authors:Idris O. Sunmola, Zhenjun Zhao, Samuel Schmidgall, Yumeng Wang, Paul Maria Scheikl, Viet Pham, Axel Krieger
Title: Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels
Abstract:
Accurate geometric reconstruction of deformable tissues in monocular endoscopic video remains a fundamental challenge in robot-assisted minimally invasive surgery. Although recent volumetric and point primitive methods based on neural radiance fields (NeRF) and 3D Gaussian primitives have efficiently rendered surgical scenes, they still struggle with handling artifact-free tool occlusions and preserving fine anatomical details. These limitations stem from unrestricted Gaussian scaling and insufficient surface alignment constraints during reconstruction. To address these issues, we introduce Surgical Gaussian Surfels (SGS), which transform anisotropic point primitives into surface-aligned elliptical splats by constraining the scale component of the Gaussian covariance matrix along the view-aligned axis. We also introduce the Fully Fused Deformation Multilayer Perceptron (FFD-MLP), a lightweight Multi-Layer Perceptron (MLP) that predicts accurate surfel motion fields up to 5x faster than a standard MLP. This is coupled with locality constraints to handle complex tissue deformations. We use homodirectional view-space positional gradients to capture fine image details by splitting Gaussian Surfels in over-reconstructed regions. In addition, we define surface normals as the direction of the steepest density change within each Gaussian surfel primitive, enabling accurate normal estimation without requiring monocular normal priors. We evaluate our method on two in-vivo surgical datasets, where it outperforms current state-of-the-art methods in surface geometry, normal map quality, and rendering efficiency, while remaining competitive in real-time rendering performance. We make our code available at https://github.com/aloma85/SurgicalGaussianSurfels
中文: 本文提出的手术高斯表面元(SGS)方法通过将点基元转换为表面对齐的椭圆块并采用快速变形网络,显著提升了内窥镜手术中可变形组织的几何重建质量,在表面细节和效率方面优于现有技术。
English: This paper introduces Surgical Gaussian Surfels (SGS), a method that enhances geometric reconstruction of deformable tissues in endoscopic surgery by transforming point primitives into surface-aligned splats and incorporating a fast deformation network, outperforming existing techniques in surface detail and efficiency.

Authors:Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Abstract:
With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
中文: 本报告提出的PP-DocBee多模态大语言模型,通过针对性的数据合成与训练策略,在英文和中文文档图像理解任务中均实现了最优性能。
English: This report introduces PP-DocBee, a multimodal large language model that achieves state-of-the-art performance in both English and Chinese document image understanding through specialized data synthesis and training techniques.

Authors:Jie Zhou, Youshu Ji, Ning Wang, Yuchen Hu, Xinyao Jiao, Bingkun Yao, Xinwei Fang, Shuai Zhao, Nan Guan, Zhe Jiang
Title: Insights from Rights and Wrongs: A Large Language Model for Solving Assertion Failures in RTL Design
Abstract:
SystemVerilog Assertions (SVAs) are essential for verifying Register Transfer Level (RTL) designs, as they can be embedded into key functional paths to detect unintended behaviours. During simulation, assertion failures occur when the design's behaviour deviates from expectations. Solving these failures, i.e., identifying and fixing the issues causing the deviation, requires analysing complex logical and timing relationships between multiple signals. This process heavily relies on human expertise, and there is currently no automatic tool available to assist with it. Here, we present AssertSolver, an open-source Large Language Model (LLM) specifically designed for solving assertion failures. By leveraging synthetic training data and learning from error responses to challenging cases, AssertSolver achieves a bug-fixing pass@1 metric of 88.54% on our testbench, significantly outperforming OpenAI's o1-preview by up to 11.97%. We release our model and testbench for public access to encourage further research: https://github.com/SEU-ACAL/reproduce-AssertSolver-DAC-25.
Chinese: AssertSolver 是一款专为解决 SystemVerilog 断言失败而设计的开源大语言模型,通过利用合成训练数据,在测试平台上实现了 88.54% 的错误修复通过率,显著优于 OpenAI 的 o1-preview 模型达 11.97%。
English: AssertSolver is an open-source Large Language Model designed to solve SystemVerilog assertion failures by leveraging synthetic training data, achieving an 88.54% bug-fixing pass rate and outperforming OpenAI's o1-preview by up to 11.97%.

Authors:Amin Karimi, Charalambos Poullis
Title: DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
Abstract:
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5^{i}$ and COCO-$20^{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}
中文: 本文提出DSV-LFS框架,通过大语言模型生成语义提示并结合密集像素匹配的视觉提示,显著提升了小样本语义分割在新类别上的泛化能力和性能,在基准测试中达到最优效果。
English: This paper introduces DSV-LFS, a novel few-shot semantic segmentation framework that leverages large language models and dense pixel-wise matching to generate semantic and visual prompts, achieving state-of-the-art performance and superior generalization on benchmark datasets.

Authors:Sungwon Kim, Yoonho Lee, Yunhak Oh, Namkyeong Lee, Sukwon Yun, Junseok Lee, Sein Kim, Carl Yang, Chanyoung Park
Title: Subgraph Federated Learning for Local Generalization
Abstract:
Federated Learning (FL) on graphs enables collaborative model training to enhance performance without compromising the privacy of each client. However, existing methods often overlook the mutable nature of graph data, which frequently introduces new nodes and leads to shifts in label distribution. Since they focus solely on performing well on each client's local data, they are prone to overfitting to their local distributions (i.e., local overfitting), which hinders their ability to generalize to unseen data with diverse label distributions. In contrast, our proposed method, FedLoG, effectively tackles this issue by mitigating local overfitting. Our model generates global synthetic data by condensing the reliable information from each class representation and its structural information across clients. Using these synthetic data as a training set, we alleviate the local overfitting problem by adaptively generalizing the absent knowledge within each local dataset. This enhances the generalization capabilities of local models, enabling them to handle unseen data effectively. Our model outperforms baselines in our proposed experimental settings, which are designed to measure generalization power to unseen data in practical scenarios. Our code is available at https://github.com/sung-won-kim/FedLoG
Chinese: FedLoG通过整合各客户端类别表示和结构信息生成全局合成数据,有效缓解联邦图学习中的局部过拟合问题,从而提升模型对未见数据及多样化标签分布的泛化能力。
English: FedLoG addresses local overfitting in federated graph learning by generating global synthetic data from class representations and structural information across clients, thereby enhancing model generalization to unseen data with diverse label distributions.

Authors:Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang
Title: RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT
中文: 多模态大语言模型在通用领域表现出色,但在医学领域,尤其是视网膜图像解读方面存在不足;为此开发的RetinalGPT通过定量分析、疾病诊断和病灶定位,显著提升了临床应用的准确性和可解释性。
English: Multimodal Large Language Models (MLLMs) are advancing but lack specialized capabilities for precise medical diagnostics, particularly in interpreting retinal images, leading to the development of RetinalGPT, which excels in quantitative analysis, disease diagnosis, and lesion localization for enhanced clinical applications.

Authors:Qianzhong Chen, Jiankai Sun, Naixiang Gao, JunEn Low, Timothy Chen, Mac Schwager
Title: GRaD-Nav: Efficiently Learning Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
Abstract:
Autonomous visual navigation is an essential element in robot autonomy. Reinforcement learning (RL) offers a promising policy training paradigm. However existing RL methods suffer from high sample complexity, poor sim-to-real transfer, and limited runtime adaptability to navigation scenarios not seen during training. These problems are particularly challenging for drones, with complex nonlinear and unstable dynamics, and strong dynamic coupling between control and perception. In this paper, we propose a novel framework that integrates 3D Gaussian Splatting (3DGS) with differentiable deep reinforcement learning (DDRL) to train vision-based drone navigation policies. By leveraging high-fidelity 3D scene representations and differentiable simulation, our method improves sample efficiency and sim-to-real transfer. Additionally, we incorporate a Context-aided Estimator Network (CENet) to adapt to environmental variations at runtime. Moreover, by curriculum training in a mixture of different surrounding environments, we achieve in-task generalization, the ability to solve new instances of a task not seen during training. Drone hardware experiments demonstrate our method's high training efficiency compared to state-of-the-art RL methods, zero shot sim-to-real transfer for real robot deployment without fine tuning, and ability to adapt to new instances within the same task class (e.g. to fly through a gate at different locations with different distractors in the environment). Our simulator and training framework are open-sourced at: https://github.com/Qianzhong-Chen/grad_nav.
中文摘要:本文提出了一种融合3D高斯溅射与可微分强化学习的新框架,通过高保真场景建模和课程训练,显著提升了无人机视觉导航的样本效率、仿真迁移能力及对未见过场景的泛化适应性。
English Summary: This paper introduces a novel framework combining 3D Gaussian Splatting with differentiable reinforcement learning to enhance drone navigation through improved sample efficiency, sim-to-real transfer, and runtime adaptability to new scenarios.

Authors:Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary W. Ulissi
Title: All-atom Diffusion Transformers: Unified generative modelling of molecules and materials
Abstract:
Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems -- such as molecules and materials -- the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on MP20, QM9 and GEOM-DRUGS datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, obtaining state-of-the-art results on par with molecule and crystal-specific models. ADiT uses standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: https://github.com/facebookresearch/all-atom-diffusion-transformer
中文:ADiT是一种统一的潜在扩散框架,能够使用同一模型联合生成分子和材料,通过可扩展的Transformer架构实现了最先进的性能。
English: ADiT is a unified latent diffusion framework that jointly generates both molecules and materials using the same model, achieving state-of-the-art performance with scalable transformer architecture.

Authors:Abdullah Mamun, Asiful Arefeen, Susan B. Racette, Dorothy D. Sears, Corrie M. Whisner, Matthew P. Buman, Hassan Ghasemzadeh
Title: LLM-Powered Prediction of Hyperglycemia and Discovery of Behavioral Treatment Pathways from Wearables and Diet
Abstract:
Postprandial hyperglycemia, marked by the blood glucose level exceeding the normal range after consuming a meal, is a critical indicator of progression toward type 2 diabetes in people with prediabetes and in healthy individuals. A key metric for understanding blood glucose dynamics after eating is the postprandial area under the curve (AUC). Predicting postprandial AUC in advance based on a person's lifestyle factors, such as diet and physical activity level, and explaining the factors that affect postprandial blood glucose could allow an individual to adjust their lifestyle accordingly to maintain normal glucose levels. In this study, we developed an explainable machine learning solution, GlucoLens, that takes sensor-driven inputs and uses advanced data processing, large language models, and trainable machine learning models to predict postprandial AUC and hyperglycemia from diet, physical activity, and recent glucose patterns. We used data obtained from wearables in a five-week clinical trial of 10 adults who worked full-time to develop and evaluate the proposed computational model that integrates wearable sensing, multimodal data, and machine learning. Our machine learning model takes multimodal data from wearable activity and glucose monitoring sensors, along with food and work logs, and provides an interpretable prediction of the postprandial glucose pattern. Our GlucoLens system achieves a normalized root mean squared error (NRMSE) of 0.123 in its best configuration. On average, the proposed technology provides a 16% better performance level compared to the comparison models. Additionally, our technique predicts hyperglycemia with an accuracy of 73.3% and an F1 score of 0.716 and recommends different treatment options to help avoid hyperglycemia through diverse counterfactual explanations. Code available: https://github.com/ab9mamun/GlucoLens.
中文: 本研究开发了可解释的机器学习系统GlucoLens,通过可穿戴传感器数据和生活方式输入预测餐后血糖水平及高血糖,在提升预测精度的同时提供可操作建议以帮助维持正常血糖水平。
English: This study introduces GlucoLens, an explainable machine learning system that predicts postprandial blood glucose levels and hyperglycemia using wearable sensor data and lifestyle inputs, achieving improved accuracy and providing actionable recommendations to help maintain normal glucose levels.

Authors:Jingyun Chen, Yading Yuan
Title: Decentralized Personalization for Federated Medical Image Segmentation via Gossip Contrastive Mutual Learning
Abstract:
Federated Learning (FL) presents a promising avenue for collaborative model training among medical centers, facilitating knowledge exchange without compromising data privacy. However, vanilla FL is prone to server failures and rarely achieves optimal performance on all participating sites due to heterogeneous data distributions among them. To overcome these challenges, we propose Gossip Contrastive Mutual Learning (GCML), a unified framework to optimize personalized models in a decentralized environment, where Gossip Protocol is employed for flexible and robust peer-to-peer communication. To make efficient and reliable knowledge exchange in each communication without the global knowledge across all the sites, we introduce deep contrast mutual learning (DCML), a simple yet effective scheme to encourage knowledge transfer between the incoming and local models through collaborative training on local data. By integrating DCML with other efforts to optimize site-specific models by leveraging useful information from peers, we evaluated the performance and efficiency of the proposed method on three publicly available datasets with different segmentation tasks. Our extensive experimental results show that the proposed GCML framework outperformed both centralized and decentralized FL methods with significantly reduced communication overhead, indicating its potential for real-world deployment. Upon the acceptance of manuscript, the code will be available at: https://github.com/CUMC-Yuan-Lab/GCML.
Chinese Summary: GCML框架通过去中心化的点对点通信和深度对比互学习优化联邦学习,在多种医疗数据集上以更低通信成本实现更优性能。
English Summary: The GCML framework enhances federated learning by enabling decentralized, peer-to-peer communication and deep contrast mutual learning, achieving superior performance with lower communication costs across diverse medical datasets.

Authors:Jiangtong Zhu, Zhao Yang, Yinan Shi, Jianwu Fang, Jianru Xue
Title: IC-Mapper: Instance-Centric Spatio-Temporal Modeling for Online Vectorized Map Construction
Abstract:
Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose IC-Mapper, an instance-centric online mapping framework, which comprises two primary components: 1) Instance-centric temporal association module: For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) Instance-centric spatial fusion module: We perform point sampling on the historical global map from a spatial dimension and integrate it with the detection results of instances corresponding to the current frame to achieve real-time expansion and update of the map. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods. Code will be released on https://github.com/Brickzhuantou/IC-Mapper.
中文:IC-Mapper是一种以实例为中心的在线地图构建框架,通过时序关联和空间融合模块提升空间扩展能力,在检测、跟踪和全局地图指标上优于现有方法。
English: IC-Mapper is an instance-centric online mapping framework that enhances spatial scalability through temporal association and spatial fusion modules, outperforming existing methods on detection, tracking, and global mapping metrics.

Authors:Shuhui Zhu, Baoxiang Wang, Sriram Ganapathi Subramanian, Pascal Poupart
Title: Learning to Negotiate via Voluntary Commitment
Abstract:
The partial alignment and conflict of autonomous agents lead to mixed-motive scenarios in many real-world applications. However, agents may fail to cooperate in practice even when cooperation yields a better outcome. One well known reason for this failure comes from non-credible commitments. To facilitate commitments among agents for better cooperation, we define Markov Commitment Games (MCGs), a variant of commitment games, where agents can voluntarily commit to their proposed future plans. Based on MCGs, we propose a learnable commitment protocol via policy gradients. We further propose incentive-compatible learning to accelerate convergence to equilibria with better social welfare. Experimental results in challenging mixed-motive tasks demonstrate faster empirical convergence and higher returns for our method compared with its counterparts. Our code is available at https://github.com/shuhui-zhu/DCL.
中文: 本研究提出了马尔可夫承诺博弈和可学习的承诺协议,通过可信承诺促进自主智能体间的合作,在混合动机任务中实现了更快的收敛速度和更高的回报。
English: This study introduces Markov Commitment Games and a learnable commitment protocol to enhance cooperation among autonomous agents by enabling credible commitments, achieving faster convergence and higher returns in mixed-motive scenarios.

Authors:Raunaq Suri, Ilan Gofman, Guangwei Yu, Jesse C. Cresswell
Title: Zero-Execution Retrieval-Augmented Configuration Tuning of Spark Applications
Abstract:
Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these parameters must be tuned to the specific job being run. Tuning commonly requires multiple executions to collect runtime information for updating parameters. This is infeasible for ad hoc queries that are run once or infrequently. Zero-execution tuning, where parameters are automatically set before a job's first run, can provide significant savings for all types of applications, but is more challenging since runtime information is not available. In this work, we propose a novel method for zero-execution tuning of Spark configurations based on retrieval. Our method achieves 93.3% of the runtime improvement of state-of-the-art one-execution optimization, entirely avoiding the slow initial execution using default settings. The shift to zero-execution tuning results in a lower cumulative runtime over the first 140 runs, and provides the largest benefit for ad hoc and analytical queries which only need to be executed once. We release the largest and most comprehensive suite of Spark query datasets, optimal configurations, and runtime information, which will promote future development of zero-execution tuning methods.
中文: 本文提出了一种基于检索的Spark配置零执行调优新方法,实现了先进单次执行优化93.3%的性能提升,完全避免了初始运行,特别适用于临时查询和分析型查询。
English: This paper introduces a novel retrieval-based method for zero-execution tuning of Spark configurations, achieving 93.3% of the performance improvement of state-of-the-art one-execution optimization while eliminating the need for initial runs, particularly benefiting ad hoc and analytical queries.

Authors:Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu
Title: EgoLife: Towards Egocentric Life Assistant
Abstract:
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
中文: EgoLife项目通过开发AI智能眼镜和构建全面的EgoLife数据集,推出了集成EgoButler框架的自我中心生活助手系统,在自我中心视频理解方面达到领先水平,为日常生活提供智能化支持。
English: The EgoLife project introduces an AI-powered wearable glasses system and the comprehensive EgoLife Dataset, enabling advanced daily life assistance through the integrated EgoButler framework with state-of-the-art egocentric understanding capabilities.

Authors:Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, David A. Clifton
Title: RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction
Abstract:
The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at https://github.com/AI-in-Health/RiskAgent.
中文: RiskAgent系统在医疗风险预测中展现出卓越性能,其准确率达76.33%,显著超越主流商业大语言模型,并通过开源不同参数规模的模型系列提升医疗场景的适应性。
English: The RiskAgent system demonstrates superior performance in medical risk prediction, achieving 76.33% accuracy across diverse clinical scenarios and significantly outperforming leading commercial LLMs, with its open-source models enhancing adaptability for various healthcare applications.

Authors:Cristian Jimenez-Romero, Alper Yegenoglu, Christian Blum
Title: Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence
Abstract:
This work examines the integration of large language models (LLMs) into multi-agent simulations by replacing the hard-coded programs of agents with LLM-driven prompts. The proposed approach is showcased in the context of two examples of complex systems from the field of swarm intelligence: ant colony foraging and bird flocking. Central to this study is a toolchain that integrates LLMs with the NetLogo simulation platform, leveraging its Python extension to enable communication with GPT-4o via the OpenAI API. This toolchain facilitates prompt-driven behavior generation, allowing agents to respond adaptively to environmental data. For both example applications mentioned above, we employ both structured, rule-based prompts and autonomous, knowledge-driven prompts. Our work demonstrates how this toolchain enables LLMs to study self-organizing processes and induce emergent behaviors within multi-agent environments, paving the way for new approaches to exploring intelligent systems and modeling swarm intelligence inspired by natural phenomena. We provide the code, including simulation files and data at https://github.com/crjimene/swarm_gpt.
本研究通过将传统智能体程序替换为LLM驱动提示,将大语言模型集成到多智能体模拟中,以蚁群觅食和鸟群聚集为例,利用NetLogo-GPT-4工具链生成适应性行为并诱导涌现智能。
This study integrates large language models into multi-agent simulations by replacing traditional agent programming with LLM-driven prompts, demonstrated through ant foraging and bird flocking examples using a NetLogo-GPT-4 toolchain to generate adaptive behaviors and emergent intelligence.

Authors:Jianqi Yan, Alex P. Leung, Zhiyuan Pei, David C. Y. Hui, Sangin Kim
Title: DeepGrav: Anomalous Gravitational-Wave Detection Through Deep Latent Features
Abstract:
This work introduces a novel deep learning-based approach for gravitational wave anomaly detection, aiming to overcome the limitations of traditional matched filtering techniques in identifying unknown waveform gravitational wave signals. We introduce a modified convolutional neural network architecture inspired by ResNet that leverages residual blocks to extract high-dimensional features, effectively capturing subtle differences between background noise and gravitational wave signals. This network architecture learns a high-dimensional projection while preserving discrepancies with the original input, facilitating precise identification of gravitational wave signals. In our experiments, we implement an innovative data augmentation strategy that generates new data by computing the arithmetic mean of multiple signal samples while retaining the key features of the original signals. In the NSF HDR A3D3: Detecting Anomalous Gravitational Wave Signals competition, it is honorable for us (group name: easonyan123) to get to the first place at the end with our model achieving a true negative rate (TNR) of 0.9708 during development/validation phase and 0.9832 on an unseen challenge dataset during final/testing phase, the highest among all competitors. These results demonstrate that our method not only achieves excellent generalization performance but also maintains robust adaptability in addressing the complex uncertainties inherent in gravitational wave anomaly detection.
中文摘要:本研究提出了一种基于改进残差网络的新型深度学习模型,通过高维特征提取和创新的数据增强策略,在国际引力波异常检测竞赛中以98.32%的准确率获得最高性能,展现了卓越的泛化能力和适应性。
English Summary: This study presents a novel ResNet-inspired deep learning model that effectively identifies gravitational wave anomalies through advanced feature extraction and data augmentation, achieving record-breaking 98.32% detection accuracy in an international competition.

Authors:Enkhtogtokh Togootogtokh, Christian Klasen
Title: VoiceGRPO: Modern MoE Transformers with Group Relative Policy Optimization GRPO for AI Voice Health Care Applications on Voice Pathology Detection
Abstract:
This research introduces a novel AI techniques as Mixture-of-Experts Transformers with Group Relative Policy Optimization (GRPO) for voice health care applications on voice pathology detection. With the architectural innovations, we adopt advanced training paradigms inspired by reinforcement learning, namely Proximal Policy Optimization (PPO) and Group-wise Regularized Policy Optimization (GRPO), to enhance model stability and performance. Experiments conducted on a synthetically generated voice pathology dataset demonstrate that our proposed models significantly improve diagnostic accuracy, F1 score, and ROC-AUC compared to conventional approaches. These findings underscore the potential of integrating transformer architectures with novel training strategies to advance automated voice pathology detection and ultimately contribute to more effective healthcare delivery. The code we used to train and evaluate our models is available at https://github.com/enkhtogtokh/voicegrpo
中文: 本研究提出了一种采用群组相对策略优化的新型专家混合Transformer模型,通过强化学习启发的训练方法显著提升了嗓音病理检测的诊断准确率和综合性能指标。
English: This study presents a novel Mixture-of-Experts Transformer with Group Relative Policy Optimization (GRPO) for voice pathology detection, demonstrating significant improvements in diagnostic accuracy and performance metrics through reinforcement learning-inspired training paradigms.

Authors:Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Yuuki Yamanaka, Tomoya Yamashita
Title: Positive-Unlabeled Diffusion Models for Preventing Sensitive Data Generation
Abstract:
Diffusion models are powerful generative models but often generate sensitive data that are unwanted by users, mainly because the unlabeled training data frequently contain such sensitive data. Since labeling all sensitive data in the large-scale unlabeled training data is impractical, we address this problem by using a small amount of labeled sensitive data. In this paper, we propose positive-unlabeled diffusion models, which prevent the generation of sensitive data using unlabeled and sensitive data. Our approach can approximate the evidence lower bound (ELBO) for normal (negative) data using only unlabeled and sensitive (positive) data. Therefore, even without labeled normal data, we can maximize the ELBO for normal data and minimize it for labeled sensitive data, ensuring the generation of only normal data. Through experiments across various datasets and settings, we demonstrated that our approach can prevent the generation of sensitive images without compromising image quality.
中文: 本文提出的正未标记扩散模型通过少量标记敏感数据和未标记数据,有效防止生成敏感内容,在保持图像质量的同时确保仅生成正常数据,并在多种数据集上验证了其有效性。
English: The proposed positive-unlabeled diffusion model prevents the generation of sensitive data by utilizing minimal labeled sensitive data alongside unlabeled data, effectively maximizing the evidence lower bound for normal data while maintaining image quality across diverse datasets.

Authors:Zanting Ye, Xiaolong Niu, Xu Han, Xuanbin Wu, Wantong Lu, Yijun Lu, Hao Sun, Yanchao Huang, Hubing Wu, Lijun Lu
Title: Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation Learning
Abstract:
Organ segmentation in Positron Emission Tomography (PET) plays a vital role in cancer quantification. Low-dose PET (LDPET) provides a safer alternative by reducing radiation exposure. However, the inherent noise and blurred boundaries make organ segmentation more challenging. Additionally, existing PET organ segmentation methods rely on coregistered Computed Tomography (CT) annotations, overlooking the problem of modality mismatch. In this study, we propose LDOS, a novel CT-free ultra-LDPET organ segmentation pipeline. Inspired by Masked Autoencoders (MAE), we reinterpret LDPET as a naturally masked version of Full-Dose PET (FDPET). LDOS adopts a simple yet effective architecture: a shared encoder extracts generalized features, while task-specific decoders independently refine outputs for denoising and segmentation. By integrating CT-derived organ annotations into the denoising process, LDOS improves anatomical boundary recognition and alleviates the PET/CT misalignments. Experiments demonstrate that LDOS achieves state-of-the-art performance with mean Dice scores of 73.11% (18F-FDG) and 73.97% (68Ga-FAPI) across 18 organs in 5% dose PET. Our code will be available at https://github.com/yezanting/LDOS.
中文: LDOS提出了一种无需CT的超低剂量PET器官分割方法,通过共享编码器和任务特定解码器增强去噪与分割效果,并结合CT标注改善边界识别与模态不匹配问题,实现了领先性能。
English: LDOS introduces a CT-free pipeline for ultra-low-dose PET organ segmentation, using a shared encoder and task-specific decoders to enhance denoising and segmentation while integrating CT-derived annotations to improve boundary recognition and address modality mismatches, achieving state-of-the-art results.

Authors:Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu
Title: CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning
Abstract:
The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower-level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher-level task. These challenges stem from VLM's lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with constrained high-frequency o}ptimized planning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at https://github.com/Yuqi-Zhou/CHOP
中文:提出的CHOP架构通过以人工规划的子任务为基础向量,克服了视觉语言模型在图形界面场景中的规划缺陷,显著提升了跨多个应用场景的移动助手效能与效率。
English: The proposed CHOP architecture enhances mobile visual language model assistants by using human-planned subtasks as basis vectors to overcome planning deficiencies in GUI scenarios, significantly improving effectiveness and efficiency across multiple applications.

Authors:Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song
Title: Improving LLM Safety Alignment with Dual-Objective Optimization
Abstract:
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
中文: 本研究揭示了直接偏好优化(DPO)在大语言模型安全对齐中的缺陷,提出通过分离拒绝训练与有害知识遗忘的改进方法,借助词级加权和分布分析显著提升了针对各类越狱攻击的鲁棒性。
English: This study identifies vulnerabilities in Direct Preference Optimization (DPO) for LLM safety alignment and proposes an enhanced method that separates refusal training from harmful knowledge unlearning, significantly boosting robustness against diverse jailbreak attacks through token-level weighting and distribution analysis.

Authors:Nianzu Yang, Pandeng Li, Liming Zhao, Yang Li, Chen-Wei Xie, Yehui Tang, Xudong Lu, Zhihang Liu, Yun Zheng, Yu Liu, Junchi Yan
Title: Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Abstract:
Existing video tokenizers typically use the traditional Variational Autoencoder (VAE) architecture for video compression and reconstruction. However, to achieve good performance, its training process often relies on complex multi-stage training tricks that go beyond basic reconstruction loss and KL regularization. Among these tricks, the most challenging is the precise tuning of adversarial training with additional Generative Adversarial Networks (GANs) in the final stage, which can hinder stable convergence. In contrast to GANs, diffusion models offer more stable training processes and can generate higher-quality results. Inspired by these advantages, we propose CDT, a novel Conditioned Diffusion-based video Tokenizer, that replaces the GAN-based decoder with a conditional causal diffusion model. The encoder compresses spatio-temporal information into compact latents, while the decoder reconstructs videos through a reverse diffusion process conditioned on these latents. During inference, we incorporate a feature cache mechanism to generate videos of arbitrary length while maintaining temporal continuity and adopt sampling acceleration technique to enhance efficiency. Trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch, extensive experiments demonstrate that CDT achieves state-of-the-art performance in video reconstruction tasks with just a single-step sampling. Even a scaled-down version of CDT (3$\times$ inference speedup) still performs comparably with top baselines. Moreover, the latent video generation model trained with CDT also exhibits superior performance. The source code and pretrained weights are available at https://github.com/ali-vilab/CDT.
中文: 提出的CDT视频分词器采用条件扩散模型替代传统基于GAN的解码器,通过稳定训练和高效推理机制,无需复杂对抗性调优即可实现最先进的视频重建效果。
English: The proposed CDT video tokenizer replaces the traditional GAN-based decoder with a conditional diffusion model, achieving state-of-the-art video reconstruction through stable training and efficient inference mechanisms without complex adversarial tuning.

Authors:Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liu
Title: DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
Abstract:
Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.
中文: DualDiff采用双分支扩散模型,通过占用射线采样和语义融合注意力等创新技术提升多模态驾驶场景生成质量,在多项任务中实现最优性能。
English: DualDiff introduces a dual-branch diffusion model with novel techniques like Occupancy Ray-shape Sampling and Semantic Fusion Attention to enhance multimodal driving scene generation, achieving state-of-the-art performance in reconstruction and downstream tasks.

Authors:Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao
Title: MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
Abstract:
LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs. Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability. Code will be available at https://github.com/rui-ye/MAS-GPT.
中文: 本文提出MAS-GPT方法,通过将多智能体系统构建重构为生成式任务并采用可执行代码表示,实现了高效生成查询自适应系统,在多个基准测试中展现出优越性能与泛化能力。
English: This paper introduces MAS-GPT, a streamlined method that reframes multi-agent system design as a generative task using executable code representations, enabling efficient and adaptive generation of query-specific systems with superior performance across benchmarks.

Authors:Bar Karov, Dor Zohar, Yam Marcovitz
Title: Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models
Abstract:
We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blueprints. While LLMs demonstrate remarkable capabilities across diverse tasks, they often fail to maintain adherence to complex, use-case-specific instructions during multi-turn conversations, presenting challenges for business-critical applications. ARQs address this limitation by guiding LLMs through systematic reasoning steps with targeted queries that reinstate critical instructions and facilitate intermediate reasoning throughout the completion process. In extensive testing within Parlant, our framework for reliable customer-facing agents in which ARQs were born out of necessity, they achieved a 90.2% success rate across 87 test scenarios, outperforming both Chain-of-Thought reasoning (86.1%) and direct response generation (81.5%). ARQs showed particular strength in addressing persistent failure modes like guideline re-application and hallucination prevention. Our analysis also revealed that ARQs can potentially be more computationally efficient than free-form reasoning when carefully designed. These findings demonstrate that structured reasoning approaches provide effective mechanisms for controlling how LLMs process information and make decisions in complex scenarios.
中文摘要:ARQs是一种新颖的结构化推理方法,通过专业推理蓝图指导大语言模型执行系统性推理步骤,在87个测试场景中达成90.2%的成功率,显著提升了复杂指令遵循能力并有效遏制幻觉现象。
English Summary: ARQs are a structured reasoning method that enhances LLMs' instruction adherence in multi-turn conversations by using specialized reasoning blueprints, achieving a 90.2% success rate in tests and outperforming traditional approaches.

Authors:Wei Li, Bing Hu, Rui Shao, Leyang Shen, Liqiang Nie
Title: LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Abstract:
First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
中文摘要:提出的LION-FS系统通过双路径优化策略,在快速路径中实现动态特征融合与冗余消除,在慢速路径中通过多粒度特征增强与思维模板,成功解决了在线视频助手效能与效率的权衡问题。
English Summary: The proposed LION-FS system overcomes the trade-off between efficacy and efficiency in online video assistants through a dual-path optimization strategy, achieving real-time performance with temporally and contextually precise responses.

Authors:Rui Zhao, Weijia Mao, Mike Zheng Shou
Title: DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
Abstract:
Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bidirectional mappings between vision and language learned by the unified generative model to enable training on unpaired data for domain adaptation. Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. The model is optimized through cross-entropy loss computed at the cycle endpoints, where both endpoints share the same modality. This facilitates self-evolution of the model without reliance on annotated text-image pairs. Experimental results demonstrate that for tasks independent of paired knowledge, such as stylization, DoraCycle can effectively adapt the unified model using only unpaired data. For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation. The code will be released at https://github.com/showlab/DoraCycle.
中文: DoraCycle通过双向多模态循环利用非配对数据实现生成模型的领域适应,仅用非配对数据即可有效完成风格化任务,而结合少量配对数据还能处理特定身份任务。
English: DoraCycle enables domain adaptation of generative models using unpaired data through bidirectional multimodal cycles, achieving effective stylization with unpaired data alone and handling identity-specific tasks with minimal paired examples.

Authors:Xiaojun Bi, Shuo Li, Junyao Xing, Ziyue Wang, Fuwen Luo, Weizheng Qiao, Lu Han, Ziwei Sun, Peng Li, Yang Liu
Title: DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms
Abstract:
Dongba pictographic is the only pictographic script still in use in the world. Its pictorial ideographic features carry rich cultural and contextual information. However, due to the lack of relevant datasets, research on semantic understanding of Dongba hieroglyphs has progressed slowly. To this end, we constructed \textbf{DongbaMIE} - the first dataset focusing on multimodal information extraction of Dongba pictographs. The dataset consists of images of Dongba hieroglyphic characters and their corresponding semantic annotations in Chinese. It contains 23,530 sentence-level and 2,539 paragraph-level high-quality text-image pairs. The annotations cover four semantic dimensions: object, action, relation and attribute. Systematic evaluation of mainstream multimodal large language models shows that the models are difficult to perform information extraction of Dongba hieroglyphs efficiently under zero-shot and few-shot learning. Although supervised fine-tuning can improve the performance, accurate extraction of complex semantics is still a great challenge at present.
中文: 为解决东巴象形文字语义理解研究的数据匮乏问题,我们构建了首个多模态数据集DongbaMIE,包含26,069个高质量图文对,但实验表明现有模型在零样本和微调场景下仍难以实现精确的语义信息抽取。
English: The DongbaMIE dataset, the first multimodal resource for Dongba pictographs, was created to advance semantic understanding by providing 26,069 annotated text-image pairs, yet current models struggle with accurate information extraction even after fine-tuning.

Authors:Woo-Jin Jung, Dong-Hee Paek, Seung-Hyun Kong
Title: L2RDaS: Synthesizing 4D Radar Tensors for Model Generalization via Dataset Expansion
Abstract:
4-dimensional (4D) radar is increasingly adopted in autonomous driving for perception tasks, owing to its robustness under adverse weather conditions. To better utilize the spatial information inherent in 4D radar data, recent deep learning methods have transitioned from using sparse point cloud to 4D radar tensors. However, the scarcity of publicly available 4D radar tensor datasets limits model generalization across diverse driving scenarios. Previous methods addressed this by synthesizing radar data, but the outputs did not fully exploit the spatial information characteristic of 4D radar. To overcome these limitations, we propose LiDAR-to-4D radar data synthesis (L2RDaS), a framework that synthesizes spatially informative 4D radar tensors from LiDAR data available in existing autonomous driving datasets. L2RDaS integrates a modified U-Net architecture to effectively capture spatial information and an object information supplement (OBIS) module to enhance reflection fidelity. This framework enables the synthesis of radar tensors across diverse driving scenarios without additional sensor deployment or data collection. L2RDaS improves model generalization by expanding real datasets with synthetic radar tensors, achieving an average increase of 4.25\% in ${{AP}_{BEV}}$ and 2.87\% in ${{AP}_{3D}}$ across three detection models. Additionally, L2RDaS supports ground-truth augmentation (GT-Aug) by embedding annotated objects into LiDAR data and synthesizing them into radar tensors, resulting in further average increases of 3.75\% in ${{AP}_{BEV}}$ and 4.03\% in ${{AP}_{3D}}$. The implementation will be available at https://github.com/kaist-avelab/K-Radar.
Chinese: L2RDaS框架通过从现有LiDAR数据合成具有丰富空间信息的4D雷达张量,无需额外采集数据即可提升自动驾驶感知模型的泛化能力和检测精度。
English: The L2RDaS framework synthesizes spatially rich 4D radar tensors from LiDAR data to enhance autonomous driving perception, improving model generalization and detection accuracy across diverse scenarios without additional data collection.

Authors:Haowei Sun, Xintao Yan, Zhijie Qiao, Haojie Zhu, Yihao Sun, Jiawei Wang, Shengyin Shen, Darian Hogue, Rajanikant Ananta, Derek Johnson, Greg Stevens, Greg McGuire, Yifan Wei, Wei Zheng, Yong Sun, Yasuo Fukai, Henry X. Liu
Title: TeraSim: Uncovering Unknown Unsafe Events for Autonomous Vehicles through Generative Simulation
Abstract:
Traffic simulation is essential for autonomous vehicle (AV) development, enabling comprehensive safety evaluation across diverse driving conditions. However, traditional rule-based simulators struggle to capture complex human interactions, while data-driven approaches often fail to maintain long-term behavioral realism or generate diverse safety-critical events. To address these challenges, we propose TeraSim, an open-source, high-fidelity traffic simulation platform designed to uncover unknown unsafe events and efficiently estimate AV statistical performance metrics, such as crash rates. TeraSim is designed for seamless integration with third-party physics simulators and standalone AV stacks, to construct a complete AV simulation system. Experimental results demonstrate its effectiveness in generating diverse safety-critical events involving both static and dynamic agents, identifying hidden deficiencies in AV systems, and enabling statistical performance evaluation. These findings highlight TeraSim's potential as a practical tool for AV safety assessment, benefiting researchers, developers, and policymakers. The code is available at https://github.com/mcity/TeraSim.
中文: TeraSim是一个开源交通仿真平台,能生成多样化安全关键事件并精确评估自动驾驶车辆性能指标,有效识别系统缺陷以提升安全评估效果。
English: TeraSim is an open-source traffic simulation platform that generates diverse safety-critical events and accurately estimates autonomous vehicle performance metrics, effectively identifying system deficiencies for enhanced safety assessment.

Authors:Songlong Xing, Zhengyu Zhao, Nicu Sebe
Title: CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Abstract:
Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable' images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.
中文: 本文提出一种无需训练的防御方法,利用CLIP预训练视觉编码器在推理阶段对对抗图像进行反制,在多个数据集上实现稳健性能且不影响正常图像的识别精度。
English: This paper introduces a training-free defense method that uses CLIP's pre-trained vision encoder to counterattack adversarial images during inference, achieving robust performance across multiple datasets without compromising clean image accuracy.

Authors:Haoran Fan, Bin Li, Yixuan Weng, Shoujun Zhou
Title: Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs
Abstract:
While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs' computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.
中文: 该摘要介绍了SMETimes,一种参数少于30亿的小型语言模型,通过统计提示和自适应融合等创新技术,解决了大语言模型在时间序列预测中的计算和内存限制问题,在显著降低资源消耗的同时实现了更优的性能。
English: This abstract introduces SMETimes, a sub-3B parameter small language model that overcomes the computational and memory limitations of large language models in time series forecasting through innovations like statistical prompting and adaptive fusion, achieving superior performance with significantly reduced resource usage.

Authors:Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang
Title: PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
Abstract:
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
中文总结:PowerAttention是一种新颖的稀疏注意力设计,能够在自回归大语言模型中实现指数级感受野扩展,在保持高计算效率的同时,显著提升了长上下文任务的处理性能。
English Summary: PowerAttention is a novel sparse attention design that enables exponential receptive field growth in autoregressive LLMs, achieving superior performance on long-context tasks while maintaining high computational efficiency.

Authors:Po-Chien Luan, Yang Gao, Celine Demonsant, Alexandre Alahi
Title: Unified Human Localization and Trajectory Prediction with Monocular Vision
Abstract:
Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.
中文:提出的MonoTransmotion框架通过单目摄像头联合实现定位与轨迹预测,在干净数据和真实噪声场景中均展现出优于基准模型的鲁棒性和泛化能力。
English: The proposed MonoTransmotion framework jointly performs localization and trajectory prediction using a monocular camera, demonstrating enhanced robustness and generalization with significant improvements over baselines in both curated and noisy real-world datasets.

Authors:Canaan Yung, Hanxun Huang, Sarah Monazam Erfani, Christopher Leckie
Title: CURVALID: Geometrically-guided Adversarial Prompt Detection
Abstract:
Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an $n$-dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at https://github.com/Cancanxxx/CurvaLID
中文: CurvaLID是一种新型防御框架,通过分析对抗性提示在词嵌入空间中的独特几何特性来检测它们,为不同模型和攻击类型提供了统一的解决方案,以实现更安全的大型语言模型部署。
English: CurvaLID is a novel defense framework that detects adversarial prompts by analyzing their unique geometric properties in word embedding spaces, providing a unified solution for safer LLM deployment across various models and attack types.

Authors:Wonjun Kang, Kevin Galim, Yuchen Zeng, Minjae Lee, Hyung Il Koo, Nam Ik Cho
Title: State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers, mitigating their quadratic computational cost. However, the application of Parameter-Efficient Fine-Tuning (PEFT) methods to SSMs remains largely unexplored. In particular, prompt-based methods like Prompt Tuning and Prefix-Tuning, which are widely used in Transformers, do not perform well on SSMs. To address this, we propose state-based methods as a superior alternative to prompt-based methods. This new family of methods naturally stems from the architectural characteristics of SSMs. State-based methods adjust state-related features directly instead of depending on external prompts. Furthermore, we introduce a novel state-based PEFT method: State-offset Tuning. At every timestep, our method directly affects the state at the current step, leading to more effective adaptation. Through extensive experiments across diverse datasets, we demonstrate the effectiveness of our method. Code is available at https://github.com/furiosa-ai/ssm-state-tuning.
中文: 状态空间模型(SSMs)虽比Transformer计算高效,但现有参数高效微调方法效果不佳,因此我们提出基于状态的优化方法,如状态偏移调优,直接调整状态特征以提升性能。
English: State Space Models (SSMs) offer computational efficiency over Transformers, but current Parameter-Efficient Fine-Tuning methods underperform, prompting the introduction of state-based techniques like State-offset Tuning that directly modify state features for better adaptation.

Authors:Linyu Fan, Che Wang, Ming Ye, Qizhi Yang, Zejun Wu, Xinghao Ding, Yue Huang, Jianfeng Bao, Shuhui Cai, Congbo Cai
Title: Bridging Synthetic-to-Real Gaps: Frequency-Aware Perturbation and Selection for Single-shot Multi-Parametric Mapping Reconstruction
Abstract:
Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ultra-fast multi-parametric reconstruction, extending its application to various clinical scenarios, the quality suffers from deficiency in mitigating the domain gap, difficulty in maintaining structural integrity, and inadequacy in ensuring mapping accuracy. To resolve these issues, we proposed frequency-aware perturbation and selection (FPS), comprising Wasserstein distance-modulated frequency-aware perturbation (WDFP) and hierarchical frequency-aware selection network (HFSNet), which integrates frequency-aware adaptive selection (FAS), compact FAS (cFAS) and feature-aware architecture integration (FAI). Specifically, perturbation activates domain-invariant feature learning within uncertainty, while selection refines optimal solutions within perturbation, establishing a robust and closed-loop learning pathway. Extensive experiments on synthetic data, along with diverse real clinical cases from 5 healthy volunteers, 94 ischemic stroke patients, and 46 meningioma patients, demonstrate the superiority and clinical applicability of FPS. Furthermore, FPS is applied to diffusion tensor imaging (DTI), underscoring its versatility and potential for broader medical applications. The code is available at https://github.com/flyannie/FPS.
中文摘要:提出的频率感知扰动与选择(FPS)方法通过结合扰动和选择机制,有效解决了医学影像重建中的领域适应问题,在多种临床数据集中展现出卓越性能。
English Summary: The proposed Frequency-aware Perturbation and Selection (FPS) method effectively addresses domain adaptation challenges in medical imaging reconstruction by combining perturbation and selection mechanisms, demonstrating superior performance across diverse clinical datasets.

Authors:Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen, Jia Pan, Wei Zhang, Bo Yang, Hua Chen
Title: Generative Artificial Intelligence in Robotic Manipulation: A Survey
Abstract:
This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation
本综述全面回顾了生成学习模型在机器人操作中的最新进展,通过多种生成范式应对数据稀缺和任务复杂性等关键挑战,并展望了未来研究方向。
This survey comprehensively reviews recent advancements in generative learning models for robotic manipulation, addressing key challenges like data scarcity and task complexity through various generative paradigms and outlining future research directions.

Authors:Alessio Galatolo, Zhenbang Dai, Katie Winkle, Meriem Beloucif
Title: Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
Abstract:
Fine-tuning Large Language Models (LLMs) with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation uses function evaluations instead of gradients, reducing memory usage, but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for Preference Optimisation in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO
Chinese Summary: 本文提出了ZOPrO,一种用于大型语言模型偏好优化的新型零阶优化算法,在增强奖励信号的同时实现了与一阶方法相当的收敛速度,并开创了零阶方法在分类任务之外的应用。
English Summary: This paper introduces ZOPrO, a novel zeroth-order optimization algorithm for preference optimization in large language models, which enhances reward signals and achieves competitive convergence times while pioneering the application of zeroth-order methods beyond classification tasks.

Authors:Junhao Xu, Yanan Zhang, Zhi Cai, Di Huang
Title: CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
Abstract:
Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as \mymethodname. By modeling the supply-demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate-late hybrid collaboration mode, where late-stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that \mymethodname~ achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code will be released at https://github.com/Xu2729/CoSDH.
中文: 提出的\mymethodname框架通过供需感知优化协作区域,并采用中晚期混合协作模式,在降低通信成本的同时保持感知精度,在真实带宽条件下实现了最优性能表现。
English: The proposed \mymethodname framework enhances multi-agent collaborative perception by optimizing collaboration regions through supply-demand awareness and employing an intermediate-late hybrid approach to maintain accuracy while reducing communication costs, achieving state-of-the-art performance in real bandwidth conditions.

Authors:Jabez Magomere, Emanuele La Malfa, Manuel Tonneau, Ashkan Kazemi, Scott Hale
Title: When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Abstract:
Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a perturbation framework that generates valid and natural claim variations, enabling us to assess the robustness of a wide-range of sentence embedding models in a multi-stage retrieval pipeline and evaluate the effectiveness of various mitigation approaches. Our evaluation reveals that standard embedding models exhibit notable performance drops on edited claims, while LLM-distilled embedding models offer improved robustness at a higher computational cost. Although a strong reranker helps to reduce the performance drop, it cannot fully compensate for first-stage retrieval gaps. To address these retrieval gaps, we evaluate train- and inference-time mitigation approaches, demonstrating that they can improve in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation. Code and data are available at https://github.com/JabezNzomo99/claim-matching-robustness.
中文: 当前用于事实核查的声明匹配系统中的句子嵌入模型对编辑后的声明表现出显著脆弱性,但基于大语言模型提炼的嵌入模型及针对性改进策略能显著提升其鲁棒性和泛化能力。
English: Current sentence embedding models used in claim-matching systems for fact-checking show significant vulnerability to edited claims, but LLM-distilled models and targeted mitigation strategies can substantially enhance robustness and generalization.

Authors:Chiyue Wei, Cong Guo, Feng Cheng, Shiyu Li, Hao "Frank" Yang, Hai "Helen" Li, Yiran Chen
Title: Prosperity: Accelerating Spiking Neural Networks via Product Sparsity
Abstract:
Spiking Neural Networks (SNNs) are highly efficient due to their spike-based activation, which inherently produces bit-sparse computation patterns. Existing hardware implementations of SNNs leverage this sparsity pattern to avoid wasteful zero-value computations, yet this approach fails to fully capitalize on the potential efficiency of SNNs. This study introduces a novel sparsity paradigm called Product Sparsity, which leverages combinatorial similarities within matrix multiplication operations to reuse the inner product result and reduce redundant computations. Product Sparsity significantly enhances sparsity in SNNs without compromising the original computation results compared to traditional bit sparsity methods. For instance, in the SpikeBERT SNN model, Product Sparsity achieves a density of only $1.23\%$ and reduces computation by $11\times$, compared to bit sparsity, which has a density of $13.19\%$. To efficiently implement Product Sparsity, we propose Prosperity, an architecture that addresses the challenges of identifying and eliminating redundant computations in real-time. Compared to prior SNN accelerator PTB and the A100 GPU, Prosperity achieves an average speedup of $7.4\times$ and $1.8\times$, respectively, along with energy efficiency improvements of $8.0\times$ and $193\times$, respectively. The code for Prosperity is available at https://github.com/dubcyfor3/Prosperity.
中文摘要:本研究提出乘积稀疏性这一新型计算范式,通过复用内积结果显著提升脉冲神经网络效率,并设计Prosperity架构实现计算量大幅削减和能效提升。
English Summary: This study introduces Product Sparsity, a novel computational paradigm that enhances efficiency in Spiking Neural Networks by reusing inner product results, achieving significant computation reduction and energy improvements through the Prosperity architecture.

Authors:Yiming Wang, Jianbin Ma, Junda Wu, Huizhe Li, Zhexuan Zhou, Youmin Gong, Jie Mei, Guangfu Ma
Title: SEAL: Safety Enhanced Trajectory Planning and Control Framework for Quadrotor Flight in Complex Environments
Abstract:
For quadrotors, achieving safe and autonomous flight in complex environments with wind disturbances and dynamic obstacles still faces significant challenges. Most existing methods address wind disturbances in either trajectory planning or control, which may lead to hazardous situations during flight. The emergence of dynamic obstacles would further worsen the situation. Therefore, we propose an efficient and reliable framework for quadrotors that incorporates wind disturbance estimations during both the planning and control phases via a generalized proportional integral observer. First, we develop a real-time adaptive spatial-temporal trajectory planner that utilizes Hamilton-Jacobi (HJ) reachability analysis for error dynamics resulting from wind disturbances. By considering the forward reachability sets propagation on an Euclidean Signed Distance Field (ESDF) map, safety is guaranteed. Additionally, a Nonlinear Model Predictive Control (NMPC) controller considering wind disturbance compensation is implemented for robust trajectory tracking. Simulation and real-world experiments verify the effectiveness of our framework. The video and supplementary material will be available at https://github.com/Ma29-HIT/SEAL/.
中文: 本研究提出了一种四旋翼飞行器的综合框架,通过广义比例积分观测器将风扰估计同时纳入轨迹规划和控制阶段,利用自适应时空规划和非线性模型预测控制实现安全飞行与鲁棒跟踪。
English: This study introduces a comprehensive framework for quadrotors that integrates wind disturbance estimation into both trajectory planning and control phases using a generalized proportional integral observer, ensuring safety through adaptive spatial-temporal planning and robust tracking with Nonlinear Model Predictive Control.

Authors:Ahmed E. Samy, Zekarias T. Kefato, Sarunas Girdzijauskas
Title: Leap: Inductive Link Prediction via Learnable TopologyAugmentation
Abstract:
Link prediction is a crucial task in many downstream applications of graph machine learning. To this end, Graph Neural Network (GNN) is a widely used technique for link prediction, mainly in transductive settings, where the goal is to predict missing links between existing nodes. However, many real-life applications require an inductive setting that accommodates for new nodes, coming into an existing graph. Thus, recently inductive link prediction has attracted considerable attention, and a multi-layer perceptron (MLP) is the popular choice of most studies to learn node representations. However, these approaches have limited expressivity and do not fully capture the graph's structural signal. Therefore, in this work we propose LEAP, an inductive link prediction method based on LEArnable toPology augmentation. Unlike previous methods, LEAP models the inductive bias from both the structure and node features, and hence is more expressive. To the best of our knowledge, this is the first attempt to provide structural contexts for new nodes via learnable augmentation in inductive settings. Extensive experiments on seven real-world homogeneous and heterogeneous graphs demonstrates that LEAP significantly surpasses SOTA methods. The improvements are up to 22\% and 17\% in terms of AUC and average precision, respectively. The code and datasets are available on GitHub (https://github.com/AhmedESamy/LEAP/)
Chinese: LEAP是一种创新的归纳式链接预测方法,通过结合结构和节点特征的可学习拓扑增强来提高表达能力,在AUC指标上显著超越现有最优方法,提升幅度高达22%。
English: LEAP is a novel inductive link prediction method that enhances expressiveness by incorporating learnable topology augmentation from both structural and feature perspectives, significantly outperforming state-of-the-art techniques with up to 22% improvement in AUC.

Authors:Guoyu Yang, Yuan Wang, Daming Shi, Yanzhong Wang
Title: Golden Cudgel Network for Real-Time Semantic Segmentation
Abstract:
Recent real-time semantic segmentation models, whether single-branch or multi-branch, achieve good performance and speed. However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi-convolutions and horizontal multi-paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self-enlarge during training and self-contract during inference, effectively becoming a "teacher model" without needing external ones. Experimental results show that GCNet outperforms existing state-of-the-art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at https://github.com/gyyang23/GCNet.
Chinese: 金箍棒网络(GCNet)提出了一种创新架构,在训练时采用垂直多卷积和水平多路径设计,并在推理时重参数化为单一卷积,无需外部教师模型即可自我增强,在多个基准数据集上实现了卓越的性能和速度。
English: The Golden Cudgel Network (GCNet) introduces a novel architecture that utilizes vertical multi-convolutions and horizontal multi-paths during training, which are reparameterized into a single convolution for inference, enabling self-enhancement without external teacher models and achieving superior performance and speed on benchmark datasets.

Authors:Xi Zhu, Haochen Xue, Ziwei Zhao, Wujiang Xu, Jingyuan Huang, Minghao Guo, Qifan Wang, Kaixiong Zhou, Yongfeng Zhang
Title: LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
Abstract:
Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: https://github.com/agiresearch/PromptGFM.
中文: PromptGFM是一种基于图词汇学习的通用图基础模型,通过图理解模块和图推理模块无缝融合大语言模型与图神经网络,显著提升了跨图和跨任务的迁移性能。
English: PromptGFM is a versatile Graph Foundation Model that integrates Large Language Models and Graph Neural Networks through graph vocabulary learning to enhance cross-graph and cross-task transferability.

Authors:Jie He, Tao Wang, Deyi Xiong, Qun Liu
Title: The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
Abstract:
Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy (60.1%) and reasoning consistency (31%). The built commonsense test suite is available at https://github.com/tjunlp-lab/CommonMT.
Chinese: 本文提出了一套评估神经机器翻译常识推理能力的测试集,结果显示其在三种歧义类型上的表现较差,准确率仅为60.1%,一致性为31%。
English: This paper introduces a test suite to assess neural machine translation's commonsense reasoning, revealing its poor performance with only 60.1% accuracy and 31% consistency across three ambiguity types.

Authors:Ji Zhao, Banglei Guan, Zibin Liu, Laurent Kneip
Title: Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
Abstract:
For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code is available at https://github.com/jizhaox/relpose-event.
中文: 本文提出多种求解器,利用线段产生的事件流形,在统一框架下估计事件相机的旋转和平移速度,无需额外传感器或运动先验即可实现全自由度运动估计。
English: This paper introduces novel solvers that estimate both rotational and translational velocities for event cameras using event manifolds from line segments, enabling full-DoF motion recovery without external sensors or motion priors.

Authors:Li Lun, Kunyu Feng, Qinglong Ni, Ling Liang, Yuan Wang, Ying Li, Dunshan Yu, Xiaoxin Cui
Title: Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
Abstract:
Spiking neural networks (SNNs) have shown their competence in handling spatial-temporal event-based data with low energy consumption. Similar to conventional artificial neural networks (ANNs), SNNs are also vulnerable to gradient-based adversarial attacks, wherein gradients are calculated by spatial-temporal back-propagation (STBP) and surrogate gradients (SGs). However, the SGs may be invisible for an inference-only model as they do not influence the inference results, and current gradient-based attacks are ineffective for binary dynamic images captured by the dynamic vision sensor (DVS). While some approaches addressed the issue of invisible SGs through universal SGs, their SGs lack a correlation with the victim model, resulting in sub-optimal performance. Moreover, the imperceptibility of existing SNN-based binary attacks is still insufficient. In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. Additionally, we propose the sparse dynamic attack (SDA) to effectively attack binary dynamic images. Utilizing a generation-reduction paradigm, SDA can fully optimize the sparsity of adversarial perturbations. Experimental results demonstrate that our PDSG and SDA outperform state-of-the-art SNN-based attacks across various models and datasets. Specifically, our PDSG achieves 100% attack success rate on ImageNet, and our SDA obtains 82% attack success rate by modifying only 0.24% of the pixels on CIFAR10DVS. The code is available at https://github.com/ryime/PDSG-SDA .
中文: 本文提出了一种基于电位依赖的替代梯度方法和稀疏动态攻击,通过解决不可见替代梯度问题并优化扰动稀疏性,显著提升了针对脉冲神经网络的对抗攻击效果,在多种模型和数据集上实现了最优性能。
English: This paper introduces a potential-dependent surrogate gradient (PDSG) method and sparse dynamic attack (SDA) to enhance adversarial attacks on spiking neural networks, achieving superior performance across models and datasets by addressing invisible surrogate gradients and optimizing perturbation sparsity.

Authors:Ping Chen, Xingpeng Zhang, Zhaoxiang Liu, Huan Hu, Xiang Liu, Kai Wang, Min Wang, Yanlin Qian, Shiguo Lian
Title: Optimizing for the Shortest Path in Denoising Diffusion Model
Abstract:
In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and quality. Drawing on Denoising Diffusion Implicit Models (DDIM) and insights from graph theory, our model, termed the Shortest Path Diffusion Model (ShortDF), treats the denoising process as a shortest-path problem aimed at minimizing reconstruction error. By optimizing the initial residuals, we improve the efficiency of the reverse diffusion process and the quality of the generated samples. Extensive experiments on multiple standard benchmarks demonstrate that ShortDF significantly reduces diffusion time (or steps) while enhancing the visual fidelity of generated samples compared to prior arts. This work, we suppose, paves the way for interactive diffusion-based applications and establishes a foundation for rapid data generation. Code is available at https://github.com/UnicomAI/ShortDF.
中文: 本研究提出了一种基于最短路径建模的ShortDF模型,通过优化残差传播提高了去噪效率和样本质量,显著减少了扩散步骤并增强了生成样本的视觉保真度。
English: This study introduces the Shortest Path Diffusion Model (ShortDF), which optimizes residual propagation to enhance denoising efficiency and quality, significantly reducing diffusion steps while improving sample fidelity.

Authors:Gangwei Xu, Jiaxin Liu, Xianqi Wang, Junda Cheng, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Title: BANet: Bilateral Aggregation Network for Mobile Stereo Matching
Abstract:
State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3\% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. Code: \textcolor{magenta}{https://github.com/gangweix/BANet}.
中文: 本文提出BANet,一种仅使用二维卷积的移动端立体匹配网络,通过分别聚合细节和平滑代价体积实现清晰边缘与精细细节,在KITTI 2015数据集上以35.3%更高精度超越其他方法,且在移动设备上运行更快。
English: This paper introduces BANet, a mobile-friendly stereo matching network that uses only 2D convolutions to achieve sharp edges and fine details by separately aggregating detailed and smooth cost volumes, outperforming other methods with 35.3% higher accuracy on KITTI 2015 while running faster on mobile devices.

Authors:Gangwei Xu, Haotong Lin, Zhaoxing Zhang, Hongcheng Luo, Haiyang Sun, Xin Yang
Title: BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation
Abstract:
Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank $1^{st}$ on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT's warm-start approach. Code: \textcolor{magenta}{https://github.com/gangweiX/BAT}.
中文摘要:BAT框架通过双向自适应时间关联技术解决了事件相机数据空间稀疏性的限制,在实现最先进光流估计精度的同时,仅使用过去事件即可准确预测未来光流,在DSEC-Flow基准测试中大幅领先现有方法。
English Summary: The BAT framework introduces bidirectional adaptive temporal correlation to overcome the spatial sparsity limitations of event cameras, achieving state-of-the-art optical flow estimation with superior accuracy and detail while enabling future flow prediction using only past events.

Authors:Jingzhou Luo, Yang Liu, Weixing Chen, Zhen Li, Yaowei Wang, Guanbin Li, Liang Lin
Title: DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
Abstract:
3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet. Codes will be available at https://github.com/LZ-CH/DSPNet.
中文: 本文提出的DSPNet通过文本引导的多视角融合和自适应双视觉感知模块,有效整合点云与多视角图像特征,在3D问答任务中实现了更鲁棒的场景理解与推理能力。
English: The proposed DSPNet enhances 3D Question Answering by integrating multi-view and point cloud features through modules that prioritize text-relevant views and adaptively fuse visual data, achieving superior performance on benchmark datasets.

Authors:Javier Yong, Haokai Ma, Yunshan Ma, Anis Yusof, Zhenkai Liang, Ee-Chien Chang
Title: AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber Attacks
Abstract:
The observations documented in Cyber Threat Intelligence (CTI) reports play a critical role in describing adversarial behaviors, providing valuable insights for security practitioners to respond to evolving threats. Recent advancements of Large Language Models (LLMs) have demonstrated significant potential in various cybersecurity applications, including CTI report understanding and attack knowledge graph construction. While previous works have proposed benchmarks that focus on the CTI extraction ability of LLMs, the sequential characteristic of adversarial behaviors within CTI reports remains largely unexplored, which holds considerable significance in developing a comprehensive understanding of how adversaries operate. To address this gap, we introduce AttackSeqBench, a benchmark tailored to systematically evaluate LLMs' capability to understand and reason attack sequences in CTI reports. Our benchmark encompasses three distinct Question Answering (QA) tasks, each task focuses on the varying granularity in adversarial behavior. To alleviate the laborious effort of QA construction, we carefully design an automated dataset construction pipeline to create scalable and well-formulated QA datasets based on real-world CTI reports. To ensure the quality of our dataset, we adopt a hybrid approach of combining human evaluation and systematic evaluation metrics. We conduct extensive experiments and analysis with both fast-thinking and slow-thinking LLMs, while highlighting their strengths and limitations in analyzing the sequential patterns in cyber attacks. The overarching goal of this work is to provide a benchmark that advances LLM-driven CTI report understanding and fosters its application in real-world cybersecurity operations. Our dataset and code are available at https://github.com/Javiery3889/AttackSeqBench .
中文摘要:AttackSeqBench是一个专为评估大语言模型在网络威胁情报报告中理解和推理攻击序列能力而设计的新基准,旨在推动其在真实网络安全运营中的实际应用。
English Summary: AttackSeqBench is a new benchmark designed to evaluate how well large language models understand and reason about sequential attack behaviors in cyber threat intelligence reports, aiming to enhance their practical application in cybersecurity.

Authors:Alexander Kolpakov, Igor Rivin
Title: Dimensionality reduction for homological stability and global structure preservation
Abstract:
We propose a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compared to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bio-informatics, and data science.
中文: DiRe是一种基于JAX的新型降维工具包,能高效、可扩展且可解释地实现数据可视化和分析,在保留数据局部与全局结构方面优于UMAP和tSNE。
English: DiRe is a new JAX-based dimensionality reduction toolkit that offers efficient, scalable, and interpretable solutions for data visualization and analysis, outperforming UMAP and tSNE in preserving both local and global data structures.

Authors:Shiyuan Zhou, Bingxuan Li, Xiyuan Chen, Zhi Tu, Yifeng Wang, Yiwen Xiang, Tianyi Zhang
Title: HEPHA: A Mixed-Initiative Image Labeling Tool for Specialized Domains
Abstract:
Image labeling is an important task for training computer vision models. In specialized domains, such as healthcare, it is expensive and challenging to recruit specialists for image labeling. We propose HEPHA, a mixed-initiative image labeling tool that elicits human expertise via inductive logic learning to infer and refine labeling rules. Each rule comprises visual predicates that describe the image. HEPHA enables users to iteratively refine the rules by either direct manipulation through a visual programming interface or by labeling more images. To facilitate rule refinement, HEPHA recommends which rule to edit and which predicate to update. For users unfamiliar with visual programming, HEPHA suggests diverse and informative images to users for further labeling. We conducted a within-subjects user study with 16 participants and compared HEPHA with a variant of HEPHA and a deep learning-based approach. We found that HEPHA outperforms the two baselines in both specialized-domain and general-domain image labeling tasks. Our code is available at https://github.com/Neural-Symbolic-Image-Labeling/NSILWeb.
Chinese: HEPHA是一种混合主动的图像标注工具,通过归纳逻辑学习推断和优化标注规则,用户可通过可视化编程或标注更多图像迭代改进规则,在专业和通用领域的图像标注任务中均优于基线方法。
English: HEPHA is a mixed-initiative image labeling tool that uses inductive logic learning to infer and refine labeling rules, enabling users to iteratively improve them through visual programming or additional image labeling, and it outperforms baseline methods in both specialized and general domains.

Authors:Xihan Qin, Li Liao
Title: Graph Transformer with Disease Subgraph Positional Encoding for Improved Comorbidity Prediction
Abstract:
Comorbidity, the co-occurrence of multiple medical conditions in a single patient, profoundly impacts disease management and outcomes. Understanding these complex interconnections is crucial, especially in contexts where comorbidities exacerbate outcomes. Leveraging insights from the human interactome (HI) and advancements in graph-based methodologies, this study introduces Transformer with Subgraph Positional Encoding (TSPE) for disease comorbidity prediction. Inspired by Biologically Supervised Embedding (BSE), TSPE employs Transformer's attention mechanisms and Subgraph Positional Encoding (SPE) to capture interactions between nodes and disease associations. Our proposed SPE proves more effective than LPE, as used in Dwivedi et al.'s Graph Transformer, underscoring the importance of integrating clustering and disease-specific information for improved predictive accuracy. Evaluated on real clinical benchmark datasets (RR0 and RR1), TSPE demonstrates substantial performance enhancements over the state-of-the-art method, achieving up to 28.24% higher ROC AUC and 4.93% higher accuracy. This method shows promise for adaptation to other complex graph-based tasks and applications. The source code is available in the GitHub repository at: https://github.com/xihan-qin/TSPE-GraphTransformer.
中文: 本研究提出TSPE图变换器模型,通过子图位置编码和生物监督嵌入机制,在疾病共病预测中实现了显著性能提升,准确率提高达4.93%,ROC AUC提升28.24%。
English: This study introduces TSPE, a graph-based Transformer model with subgraph positional encoding that significantly improves disease comorbidity prediction accuracy by capturing complex disease interactions through biological insights.

Authors:Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
Title: QE4PE: Word-level Quality Estimation for Human Post-Editing
Abstract:
Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
中文: 词级质量评估旨在识别机器翻译中的错误以辅助人工后编辑,但其对编辑效率和质量的实用影响尚待深入研究,其中领域和编辑速度等因素比高亮来源更能决定其有效性。
English: Word-level quality estimation aids in identifying machine translation errors to assist human post-editing, yet its practical impact on editing efficiency and quality remains underexplored, with factors like domain and editor speed influencing highlight effectiveness more than the source of the highlights themselves.

Authors:Yizhe Zhang, Navdeep Jaitly
Title: SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation
Abstract:
Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level. https://github.com/apple/ml-sage-dialog-gen
中文摘要:SAGE框架通过引入潜在变量来控制对话生成中的情感状态和会话策略,在保持自然交互和语言模型基准性能的同时,显著提升了聊天机器人的情感智能表现。
English Summary: The SAGE framework introduces latent variables to control emotional states and conversational strategies in dialogue generation, enabling improved emotional intelligence in chatbots while maintaining natural interaction and strong performance on language model benchmarks.

Authors:Siqi Ouyang, Xi Xu, Lei Li
Title: InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model
Abstract:
Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the history speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. We release the code and demo at https://github.com/LeiLiLab/InfiniSST
中文:InfiniSST提出了一种新颖的多轮对话方法用于无边界流式语音同传,通过优化的数据构建和缓存管理策略,在保持翻译质量的同时将延迟降低了0.5-1秒。
English: InfiniSST introduces a novel multi-turn dialogue approach for simultaneous streaming speech translation, reducing latency by 0.5-1 seconds while maintaining quality through optimized data construction and cache management.

Authors:Danqing Zhang, Balaji Rama, Jingyi Ni, Shiying He, Fu Zhao, Kunyu Chen, Arnold Chen, Junyu Cao
Title: LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications
Abstract:
We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent's API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at https://github.com/PathOnAI/LiteWebAgent, with deployed frontend at https://lite-web-agent.vercel.app/.
中文:LiteWebAgent是一个基于VLM的开源网页代理框架,提供生产就绪的解决方案,具备简化的服务器配置、直观的用户界面以及可扩展的研究功能,如代理规划和树搜索。
English: LiteWebAgent is an open-source framework for VLM-based web agents that offers a production-ready solution with minimal serversetups, user-friendly interfaces, and extensible research features like planning and tree search.

Authors:Yue Meng, Chuchu fan
Title: Diverse Controllable Diffusion Policy with Signal Temporal Logic
Abstract:
Generating realistic simulations is critical for autonomous system applications such as self-driving and human-robot interactions. However, driving simulators nowadays still have difficulty in generating controllable, diverse, and rule-compliant behaviors for road participants: Rule-based models cannot produce diverse behaviors and require careful tuning, whereas learning-based methods imitate the policy from data but are not designed to follow the rules explicitly. Besides, the real-world datasets are by nature "single-outcome", making the learning method hard to generate diverse behaviors. In this paper, we leverage Signal Temporal Logic (STL) and Diffusion Models to learn controllable, diverse, and rule-aware policy. We first calibrate the STL on the real-world data, then generate diverse synthetic data using trajectory optimization, and finally learn the rectified diffusion policy on the augmented dataset. We test on the NuScenes dataset and our approach can achieve the most diverse rule-compliant trajectories compared to other baselines, with a runtime 1/17X to the second-best approach. In the closed-loop testing, our approach reaches the highest diversity, rule satisfaction rate, and the least collision rate. Our method can generate varied characteristics conditional on different STL parameters in testing. A case study on human-robot encounter scenarios shows our approach can generate diverse and closed-to-oracle trajectories. The annotation tool, augmented dataset, and code are available at https://github.com/mengyuest/pSTL-diffusion-policy.
中文: 本文提出了一种结合信号时序逻辑和扩散模型的方法,用于生成可控、多样且符合规则的自动驾驶行为,在多样性、规则遵循和防撞性能上均优于现有基准方法。
English: This paper introduces a method combining Signal Temporal Logic and Diffusion Models to generate controllable, diverse, and rule-compliant behaviors for autonomous systems, achieving superior performance in diversity, rule adherence, and collision avoidance compared to existing approaches.

Authors:Wenqi Guo, Yiyang Du, Shan Du
Title: LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset
Abstract:
Gas leakage poses a significant hazard that requires prevention. Traditionally, human inspection has been used for detection, a slow and labour-intensive process. Recent research has applied machine learning techniques to this problem, yet there remains a shortage of high-quality, publicly available datasets. This paper introduces a synthetic dataset, SimGas, featuring diverse backgrounds, interfering foreground objects, diverse leak locations, and precise segmentation ground truth. We propose a zero-shot method that combines background subtraction, zero-shot object detection, filtering, and segmentation to leverage this dataset. Experimental results indicate that our approach significantly outperforms baseline methods based solely on background subtraction and zero-shot object detection with segmentation, reaching an IoU of 69%. We also present an analysis of various prompt configurations and threshold settings to provide deeper insights into the performance of our method. Finally, we qualitatively (because of the lack of ground truth) tested our performance on GasVid and reached decent results on the real-world dataset. The dataset, code, and full qualitative results are available at https://github.com/weathon/Lang-Gas.
中文: 本文提出了用于气体泄漏检测的合成数据集SimGas和一种零样本方法,该方法以69%的交并比显著优于基线方法,并在真实数据上取得了良好效果。
English: This paper introduces SimGas, a synthetic dataset for gas leak detection, and proposes a zero-shot method that outperforms baseline approaches with a 69% IoU, also demonstrating decent performance on real-world data.

Authors:Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Title: Surgical Vision World Model
Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
Chinese: 本文提出了首个手术视觉世界模型,通过利用未标记视频数据生成可控制动作的手术数据,并在SurgToolLoc-2022数据集上进行了验证。
English: This paper introduces the first surgical vision world model, which generates action-controllable surgical data by leveraging unlabeled video data and has been validated on the SurgToolLoc-2022 dataset.

Authors:Saurabh Koju, Saurav Bastola, Prashant Shrestha, Sanskar Amgain, Yash Raj Shrestha, Rudra P. K. Poudel, Binod Bhattarai
Title: Surgical Vision World Model
Abstract:
Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at https://github.com/bhattarailab/Surgical-Vision-World-Model
Chinese: 本文提出了首个手术视觉世界模型,通过利用未标记视频数据生成可控制动作的手术数据,并在SurgToolLoc-2022数据集上进行了验证。
English: This paper introduces the first surgical vision world model, which generates action-controllable surgical data by leveraging unlabeled video data and has been validated on the SurgToolLoc-2022 dataset.

Authors:Yinzhou Tang, Jinghua Piao, Huandong Wang, Shaw Rajib, Yong Li
Title: Predicting Cascade Failures in Interdependent Urban Infrastructure Networks
Abstract:
Cascading failures (CF) entail component breakdowns spreading through infrastructure networks, causing system-wide collapse. Predicting CFs is of great importance for infrastructure stability and urban function. Despite extensive research on CFs in single networks such as electricity and road networks, interdependencies among diverse infrastructures remain overlooked, and capturing intra-infrastructure CF dynamics amid complex evolutions poses challenges. To address these gaps, we introduce the \textbf{I}ntegrated \textbf{I}nterdependent \textbf{I}nfrastructure CF model ($I^3$), designed to capture CF dynamics both within and across infrastructures. $I^3$ employs a dual GAE with global pooling for intra-infrastructure dynamics and a heterogeneous graph for inter-infrastructure interactions. An initial node enhancement pre-training strategy mitigates GCN-induced over-smoothing. Experiments demonstrate $I^3$ achieves a 31.94\% in terms of AUC, 18.03\% in terms of Precision, 29.17\% in terms of Recall, 22.73\% in terms of F1-score boost in predicting infrastructure failures, and a 28.52\% reduction in terms of RMSE for cascade volume forecasts compared to leading models. It accurately pinpoints phase transitions in interconnected and singular networks, rectifying biases in models tailored for singular networks. Access the code at https://github.com/tsinghua-fib-lab/Icube.
中文:提出的I³模型能有效预测相互依赖基础设施网络内外的级联故障,相比现有模型在故障预测准确性和级联规模预测方面均实现显著性能提升。
English: The proposed I³ model effectively predicts cascading failures within and across interdependent infrastructure networks, achieving significant performance improvements in failure prediction accuracy and cascade volume forecasting compared to existing models.

Authors:Qinyu Zhao, Stephen Gould, Liang Zheng
Title: ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models
Abstract:
Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
中文摘要:提出的ARINAR模型采用双层自回归方法,先预测下一标记的条件向量,再逐步生成标记特征,在保持与先进模型相当图像生成质量的同时,大幅提升了生成速度。
English Summary: The proposed ARINAR model introduces a bi-level autoregressive approach that first predicts a condition vector for the next token and then generates token features sequentially, achieving comparable image generation quality to state-of-the-art models while significantly improving speed.

Authors:Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
Title: Wikipedia in the Era of LLMs: Evolution and Risks
Abstract:
In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia's recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
本研究分析了大语言模型对维基百科的影响,显示某些类别受到约1%-2%的影响,并指出因内容污染可能导致基准测试分数虚高及检索增强生成效果下降等风险。
This study analyzes how Large Language Models are influencing Wikipedia, showing a 1%-2% impact in some areas and highlighting risks like inflated benchmark scores and reduced effectiveness of retrieval-augmented generation due to potential content pollution.

Authors:Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
Title: SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models
Abstract:
Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, Thorax, and Breast with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: https://github.com/HistAI/SPIDER
中文摘要:SPIDER作为最大的公开病理图像数据集,提供多器官覆盖和专家验证标注,其配套的基准模型实现了最先进的性能,为计算病理学AI研究设立了新标准并支持多维组织分析。
English Summary: SPIDER is introduced as the largest public patch-level pathology dataset with expert-verified annotations and multi-organ coverage, accompanied by state-of-the-art baseline models that enable advanced tissue analysis and set new benchmarks for AI in computational pathology.

Authors:Nuria Alina Chandra, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Hannah Lee, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Sejin Paik, Changyeon Lee, Jongwook Choi, Aerin Kim, Oren Etzioni
Title: Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024
Abstract:
In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.
中文: Deepfake-Eval-2024基准测试表明,现有深度伪造检测模型在真实场景中表现不佳,相比过时的学术数据集性能下降高达50%,尽管商业模型和微调模型相比开源方案有所改进。
English: The Deepfake-Eval-2024 benchmark reveals that current deepfake detection models perform poorly on real-world content, with performance dropping up to 50% compared to outdated academic datasets, though commercial and fine-tuned models show some improvement over open-source alternatives.

Authors:Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
Title: (How) Do Language Models Track State?
Abstract:
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
中文: 变换器语言模型能够学习可解释的状态跟踪机制,如关联扫描和启发式剪枝,用于排列组合等任务,且这些机制的出现可通过训练进行预测和控制。
English: Transformer language models learn interpretable state tracking mechanisms, such as associative scans and heuristic-based pruning, for tasks like permutation composition, and their emergence can be predicted and controlled through training.

Authors:Zicong He, Boxuan Zhang, Lu Cheng
Title: Shakespearean Sparks: The Dance of Hallucination and Creativity in LLMs' Decoding Layers
Abstract:
Large language models (LLMs) are known to hallucinate, a phenomenon often linked to creativity. While previous research has primarily explored this connection through theoretical or qualitative lenses, our work takes a quantitative approach to systematically examine the relationship between hallucination and creativity in LLMs. Given the complex nature of creativity, we propose a narrow definition tailored to LLMs and introduce an evaluation framework, HCL, which quantifies Hallucination and Creativity across different Layers of LLMs during decoding. Our empirical analysis reveals a tradeoff between hallucination and creativity that is consistent across layer depth, model type, and model size. Notably, across different model architectures, we identify a specific layer at each model size that optimally balances this tradeoff. Additionally, the optimal layer tends to appear in the early layers of larger models, and the confidence of the model is also significantly higher at this layer. These findings provide a quantitative perspective that offers new insights into the interplay between LLM creativity and hallucination. The code and data for our experiments are available at https://github.com/ZicongHe2002/HCL-Spark.
中文摘要:本研究通过HCL评估框架对大型语言模型的幻觉与创造力进行定量分析,发现在不同架构和规模的模型中存在特定层能最优平衡二者关系,且较大模型的早期层表现更佳。
English Summary: This study quantitatively analyzes the tradeoff between hallucination and creativity in large language models using the HCL evaluation framework, identifying optimal model layers that balance this relationship across different architectures and sizes.

Authors:Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen
Title: Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
Abstract:
Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.
中文: 本文提出了一种名为Mask-DPO的细粒度事实对齐方法,通过在训练中专注于句子层面的真实性,有效提升大语言模型回答的准确性,显著改善了其在领域内和领域外数据集上的表现。
English: This paper introduces Mask-DPO, a fine-grained factuality alignment method that enhances the accuracy of large language models by focusing on sentence-level factual correctness during training, significantly improving performance on both in-domain and out-of-domain datasets.

Authors:Michal Nazarczuk, Karla Stepanova, Jan Kristof Behrens, Matej Hoffmann, Krystian Mikolajczyk
Title: MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation
Abstract:
Current embodied reasoning agents struggle to plan for long-horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long-horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed-loop methods through environment interaction on two levels: visual - action loop, and control - physics loop. Together with the simulator, we propose SHOP-VRB2, a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.
中文: 当前具身推理智能体难以规划需要物理交互的长时程任务,为此我们开发了MuBlE模拟环境,它结合MuJoCo物理引擎和Blender渲染器提供真实的多模态数据,并推出SHOP-VRB2基准测试来评估需要视觉与物理测量的多步骤推理场景。
English: Current embodied reasoning agents face challenges in planning long-horizon tasks requiring physical interaction, prompting the development of MuBlE—a novel simulation environment using MuJoCo physics and Blender rendering to provide realistic multimodal data and support closed-loop methods, alongside the SHOP-VRB2 benchmark for multi-step reasoning scenarios.

Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Title: AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Abstract:
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
中文摘要:AlignDistil是一种基于令牌级奖励优化的对齐方法,通过将DPO奖励融入RLHF目标并设计自适应对数外推机制,有效提升大语言模型对齐性能并加快收敛速度。
English Summary: AlignDistil is a token-level reward optimization method that enhances LLM alignment by integrating DPO rewards into RLHF objectives, using adaptive token-level distillation to improve performance and accelerate convergence.

Authors:Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alán Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, Kirill Neklyudov
Title: Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts
Abstract:
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.
中文: 本文提出了Feynman-Kac校正器(FKCs),这是一种基于Feynman-Kac公式的高效原理性采样方法,通过序列蒙特卡洛算法改进了预训练得分模型的受控采样,有效解决了现有无分类器引导方法的局限性。
English: This paper introduces Feynman-Kac Correctors (FKCs), an efficient and principled method for controlled sampling from pretrained score-based models, addressing limitations in existing classifier-free guidance by leveraging Sequential Monte Carlo algorithms to improve sampling quality across various applications.

Authors:Weihang Wang, Duolin Sun, Jielei Zhang, Longwen Gao
Title: MX-Font++: Mixture of Heterogeneous Aggregation Experts for Few-shot Font Generation
Abstract:
Few-shot Font Generation (FFG) aims to create new font libraries using limited reference glyphs, with crucial applications in digital accessibility and equity for low-resource languages, especially in multilingual artificial intelligence systems. Although existing methods have shown promising performance, transitioning to unseen characters in low-resource languages remains a significant challenge, especially when font glyphs vary considerably across training sets. MX-Font considers the content of a character from the perspective of a local component, employing a Mixture of Experts (MoE) approach to adaptively extract the component for better transition. However, the lack of a robust feature extractor prevents them from adequately decoupling content and style, leading to sub-optimal generation results. To alleviate these problems, we propose Heterogeneous Aggregation Experts (HAE), a powerful feature extraction expert that helps decouple content and style downstream from being able to aggregate information in channel and spatial dimensions. Additionally, we propose a novel content-style homogeneity loss to enhance the untangling. Extensive experiments on several datasets demonstrate that our MX-Font++ yields superior visual results in FFG and effectively outperforms state-of-the-art methods. Code and data are available at https://github.com/stephensun11/MXFontpp.
中文:MX-Font++ 提出了异构聚合专家和内容风格同质性损失函数,能有效解耦内容与风格,在少样本字体生成任务中实现了优于现有方法的视觉效果。
English: MX-Font++ introduces Heterogeneous Aggregation Experts and a content-style homogeneity loss to effectively decouple content and style, achieving superior few-shot font generation results compared to existing methods.

Authors:Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li
Title: Teaching Your Models to Understand Code via Focal Preference Alignment
Abstract:
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
Chinese: Target-DPO提出了一种新的代码大模型偏好对齐框架,通过精确定位错误区域并对其相应标记进行对齐来模拟人类调试过程,从而在代码生成任务中实现显著性能提升并减少错误。
English: Target-DPO introduces a novel preference alignment framework for Code LLMs that mimics human debugging by explicitly locating error regions and aligning corresponding tokens, leading to significant performance gains and fewer errors in code generation tasks.

Authors:Jie Wu, Haoling Li, Xin Zhang, Xiao Liu, Yangyu Huang, Jianwen Luo, Yizhen Zhang, Zuchao Li, Ruihang Chu, Yujiu Yang, Scarlett Li
Title: Teaching Your Models to Understand Code via Focal Preference Alignment
Abstract:
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
Chinese: Target-DPO提出了一种新的代码大模型偏好对齐框架,通过精确定位错误区域并对其相应标记进行对齐来模拟人类调试过程,从而在代码生成任务中实现显著性能提升并减少错误。
English: Target-DPO introduces a novel preference alignment framework for Code LLMs that mimics human debugging by explicitly locating error regions and aligning corresponding tokens, leading to significant performance gains and fewer errors in code generation tasks.

Authors:Daniil Larionov, Steffen Eger
Title: BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression
Abstract:
Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.
Chinese: BatchGEMBA-MQM框架通过批量提示和提示压缩技术,将机器翻译评估的令牌使用量减少2-4倍,在GPT-4o模型批量大小为4时仍保持90%以上基准性能,有效缓解了批量处理导致的质量下降问题。
English: BatchGEMBA-MQM introduces batched prompting and prompt compression to significantly reduce token usage and computational costs in LLM-based machine translation evaluation, maintaining over 90% baseline performance with GPT-4o at batch size 4 despite minor quality impacts from batching.

Authors:Shuang Chen, Yifeng He, Barry Lennox, Farshad Arvin, Amir Atapour-Abarghouei
Title: Deep Learning-Enhanced Visual Monitoring in Hazardous Underwater Environments with a Swarm of Micro-Robots
Abstract:
Long-term monitoring and exploration of extreme environments, such as underwater storage facilities, is costly, labor-intensive, and hazardous. Automating this process with low-cost, collaborative robots can greatly improve efficiency. These robots capture images from different positions, which must be processed simultaneously to create a spatio-temporal model of the facility. In this paper, we propose a novel approach that integrates data simulation, a multi-modal deep learning network for coordinate prediction, and image reassembly to address the challenges posed by environmental disturbances causing drift and rotation in the robots' positions and orientations. Our approach enhances the precision of alignment in noisy environments by integrating visual information from snapshots, global positional context from masks, and noisy coordinates. We validate our method through extensive experiments using synthetic data that simulate real-world robotic operations in underwater settings. The results demonstrate very high coordinate prediction accuracy and plausible image assembly, indicating the real-world applicability of our approach. The assembled images provide clear and coherent views of the underwater environment for effective monitoring and inspection, showcasing the potential for broader use in extreme settings, further contributing to improved safety, efficiency, and cost reduction in hazardous field monitoring. Code is available on https://github.com/ChrisChen1023/Micro-Robot-Swarm.
中文摘要:本文提出一种利用协作机器人和多模态深度学习的新方法,通过集成数据模拟和图像重组技术解决水下机器人位置漂移和旋转问题,从而提高极端环境监测的精度与效率。
English Summary: This paper presents a novel method using collaborative robots and multi-modal deep learning to enhance the precision of underwater environmental monitoring by addressing positional drift and rotation through integrated data simulation and image reassembly.

Authors:Shuaike Li, Kai Zhang, Qi Liu, Enhong Chen
Title: MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality
Abstract:
Knowledge editing is a technique for efficiently and accurately updating the knowledge of large language models (LLMs) to alleviate obsolescence and correct errors. However, most existing methods overfit to specific models, causing edited knowledge to be discarded during each LLM update and requiring frequent re-editing, which is particularly burdensome in today's rapidly evolving open-source community. To address this issue, we propose the problem of cross-model knowledge editing and introduce MindBridge, a scalable solution inspired by the low coupling between modality processing and LLMs in multi-modal models. MindBridge introduces the novel concept of memory modality, which encodes edited knowledge as an independent modality. It first performs LLM-agnostic pre-training of the memory modality and then integrates it with various LLMs. Extensive experiments on multiple LLMs and popular knowledge editing datasets demonstrate that MindBridge achieves superior performance even in editing tens of thousands of knowledge entries and can flexibly adapt to different LLMs. Our code is available at https://github.com/CrashBugger/MindBridge.
中文摘要:知识编辑旨在高效更新大语言模型的知识以修正错误,但现有方法常因模型更新而失效,需反复编辑;MindBridge提出跨模型解决方案,通过独立预训练的记忆模态与多种模型集成,在大量知识条目编辑中表现优异且灵活适配。
English Summary: Knowledge editing updates LLMs to fix outdated or incorrect information, but current methods often fail when models are updated, requiring repeated edits; MindBridge introduces a cross-model solution using a memory modality that pre-trains independently and integrates with various LLMs, proving effective and adaptable in extensive testing.

Authors:Yifei Wang, Jacky Keung, Haohan Xu, Yuchen Cao, Zhenyu Mao
Title: Multi-Strategy Enhanced COA for Path Planning in Autonomous Navigation
Abstract:
Autonomous navigation is reshaping various domains in people's life by enabling efficient and safe movement in complex environments. Reliable navigation requires algorithmic approaches that compute optimal or near-optimal trajectories while satisfying task-specific constraints and ensuring obstacle avoidance. However, existing methods struggle with slow convergence and suboptimal solutions, particularly in complex environments, limiting their real-world applicability. To address these limitations, this paper presents the Multi-Strategy Enhanced Crayfish Optimization Algorithm (MCOA), a novel approach integrating three key strategies: 1) Refractive Opposition Learning, enhancing population diversity and global exploration, 2) Stochastic Centroid-Guided Exploration, balancing global and local search to prevent premature convergence, and 3) Adaptive Competition-Based Selection, dynamically adjusting selection pressure for faster convergence and improved solution quality. Empirical evaluations underscore the remarkable planning speed and the amazing solution quality of MCOA in both 3D Unmanned Aerial Vehicle (UAV) and 2D mobile robot path planning. Against 11 baseline algorithms, MCOA achieved a 69.2% reduction in computational time and a 16.7% improvement in minimizing overall path cost in 3D UAV scenarios. Furthermore, in 2D path planning, MCOA outperformed baseline approaches by 44% on average, with an impressive 75.6% advantage in the largest 60*60 grid setting. These findings validate MCOA as a powerful tool for optimizing autonomous navigation in complex environments. The source code is available at: https://github.com/coedv-hub/MCOA.
Chinese Summary: 本文提出的多策略增强小龙虾优化算法(MCOA)通过集成三种创新策略,在三维无人机路径规划中实现了计算时间减少69.2%和路径成本优化16.7%的显著突破,大幅提升了复杂环境下自主导航的性能。
English Summary: This paper introduces the Multi-Strategy Enhanced Crayfish Optimization Algorithm (MCOA), which significantly improves autonomous navigation by reducing computational time by 69.2% and enhancing path quality by 16.7% in 3D UAV scenarios, outperforming 11 baseline algorithms.

Authors:Vincent Emonet, Ana-Claudia Sima, Tarcisio Mendes de Farias
Title: A user-friendly SPARQL query editor powered by lightweight metadata
Abstract:
SPARQL query editors often lack intuitive interfaces to aid SPARQL-savvy users to write queries. To address this issue, we propose an easy-to-deploy, triple store-agnostic and open-source query editor that offers three main features: (i) automatic query example rendering, (ii) precise autocomplete based on existing triple patterns including within SERVICE clauses, and (iii) a data-aware schema visualization. It can be easily set up with a custom HTML element. The tool has been successfully tested on various public endpoints, and is deployed online at https://sib-swiss.github.io/sparql-editor with open-source code available at https://github.com/sib-swiss/sparql-editor.
中文: 本文提出了一种开源、易于部署的SPARQL查询编辑器,通过自动示例渲染、精确自动补全和数据感知模式可视化提升可用性,已在公共端点上成功测试并在线提供。
English: This paper introduces an open-source, easy-to-deploy SPARQL query editor that enhances usability with automatic example rendering, precise autocomplete, and data-aware schema visualization, successfully tested on public endpoints and available online.

Authors:Pengwei Tang, Yong Liu, Dongjie Zhang, Xing Wu, Debing Zhang
Title: LoRA-Null: Low-Rank Adaptation via Null Space for Large Language Models
Abstract:
Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs). However, the fine-tuned LLMs encounter the issue of catastrophic forgetting of the pre-trained world knowledge. To address this issue, inspired by theoretical insights of null space, we propose LoRA-Null, i.e., Low-Rank Adaptation via null space, which builds adapters initialized from the null space of the pre-trained knowledge activation. Concretely, we randomly collect a few data samples and capture their activations after passing through the LLM layer. We perform Singular Value Decomposition on the input activations to obtain their null space. We use the projection of the pre-trained weights onto the null space as the initialization for adapters. Experimental results demonstrate that this initialization approach can effectively preserve the original pre-trained world knowledge of the LLMs during fine-tuning. Additionally, if we freeze the values of the down-projection matrices during fine-tuning, it achieves even better preservation of the pre-trained world knowledge. LoRA-Null effectively preserves pre-trained world knowledge while maintaining strong fine-tuning performance, as validated by extensive experiments on LLaMA series (LLaMA2, LLaMA3, LLaMA3.1, and LLaMA3.2) across Code, Math, and Instruction Following tasks. We also provide a theoretical guarantee for the capacity of LoRA-Null to retain pre-trained knowledge. Code is in https://github.com/HungerPWAY/LoRA-Null.
中文: LoRA-Null是一种新颖的参数高效微调方法,通过从预训练知识激活的零空间初始化适配器,有效保留原始世界知识,同时在多项任务中保持强大性能。
English: LoRA-Null is a novel parameter-efficient fine-tuning method that initializes adapters from the null space of pre-trained knowledge activations, effectively preserving original world knowledge while maintaining strong performance across various tasks.

Authors:Xiaoyu Zheng, Xu Chen, Shaogang Gong, Xavier Griffin, Greg Slabaugh
Title: XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification
Abstract:
Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.
中文摘要:提出的XFMamba模型采用基于状态空间序列的交叉融合架构,能有效学习单视图特征及其跨视图差异,在多个医学图像数据集上超越了现有卷积和Transformer方法。
English Summary: The proposed XFMamba model introduces a novel cross-fusion architecture using state space sequence models to effectively capture both single-view features and cross-view correlations, outperforming existing methods on multiple medical image datasets.

Authors:Zirun Guo, Tao Jin
Title: Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises
Abstract:
Test-Time Adaptation (TTA) aims to tackle distribution shifts using unlabeled test data without access to the source data. In the context of multimodal data, there are more complex noise patterns than unimodal data such as simultaneous corruptions for multiple modalities and missing modalities. Besides, in real-world applications, corruptions from different distribution shifts are always mixed. Existing TTA methods always fail in such multimodal scenario because the abrupt distribution shifts will destroy the prior knowledge from the source model, thus leading to performance degradation. To this end, we reveal a new challenge named multimodal wild TTA. To address this challenging problem, we propose two novel strategies: sample identification with interquartile range Smoothing and unimodal assistance, and Mutual information sharing (SuMi). SuMi smooths the adaptation process by interquartile range which avoids the abrupt distribution shifts. Then, SuMi fully utilizes the unimodal features to select low-entropy samples with rich multimodal information for optimization. Furthermore, mutual information sharing is introduced to align the information, reduce the discrepancies and enhance the information utilization across different modalities. Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at https://github.com/zrguo/SuMi.
中文摘要:测试时自适应(TTA)旨在利用未标记测试数据应对分布偏移,但在复杂多模态噪声场景中表现不佳;提出的SuMi方法通过平滑适应过程和增强跨模态信息共享,有效解决了这一难题。
English Summary: Test-Time Adaptation (TTA) addresses distribution shifts using unlabeled test data, but struggles with complex multimodal noise patterns; the proposed SuMi method overcomes this by smoothing adaptation processes and enhancing cross-modal information sharing.

Authors:Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, Kailun Yang
Title: Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts
Abstract:
Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model's parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.
中文: 本文提出的BiT-Align框架通过深度图像提示和文本引导注意力机制改进功能映射,在基准数据集上以更少参数实现了更高精度。
English: This paper introduces the BiT-Align framework, which enhances affordance mapping by integrating depth images as prompts and using text-guided attention, achieving higher accuracy with fewer parameters on benchmark datasets.

Authors:Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi
Title: Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Abstract:
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem, we propose \MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code and model are publicly available at https://github.com/sony/aki to encourage further advancements in MLLMs across various directions.
中文:针对多模态大语言模型中视觉与语言不对齐的问题,提出的MapleLeaf AKI模型通过将因果注意力机制替换为模态互注意力机制,使图像标记能够关注文本标记,从而在12个多模态理解基准上实现卓越性能,且无需额外参数或增加训练时间。
English: Recent Multimodal Large Language Models (MLLMs) face vision-language misalignment issues, which the proposed MapleLeaf AKI model addresses by replacing causal attention with modality-mutual attention, enabling image tokens to interact with text tokens and achieving superior performance across 12 benchmarks without extra parameters or training time.

Authors:Yanlong Xu, Haoxuan Qu, Jun Liu, Wenxiao Zhang, Xun Yang
Title: CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework
Abstract:
The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this $\textbf{partially relevant}$ challenge, we propose $\textbf{CMMLoc}$, an uncertainty-aware $\textbf{C}$auchy-$\textbf{M}$ixture-$\textbf{M}$odel ($\textbf{CMM}$) based framework for text-to-point-cloud $\textbf{Loc}$alization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose dataset. Codes are available in this GitHub repository https://github.com/kevin301342/CMMLoc.
Chinese: 针对点云定位中文本描述部分相关性的挑战,我们提出了CMMLoc框架,通过整合柯西混合模型约束和空间整合方案,实现了精确的文本到点云定位,在KITTI360Pose数据集上取得了最先进的性能。
English: To address the challenge of partially relevant text descriptions in point cloud localization, we propose CMMLoc, an uncertainty-aware framework that integrates Cauchy-Mixture-Model constraints and spatial consolidation for precise text-to-point-cloud localization, achieving state-of-the-art performance on the KITTI360Pose dataset.

Authors:Patryk Marszałek, Maciej Rut, Piotr Kawa, Przemysław Spurek, Piotr Syga
Title: A Hypernetwork-Based Approach to KAN Representation of Audio Signals
Abstract:
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
Chinese: 本研究提出Kolmogorov-Arnold网络(KAN)作为音频的有效隐式神经表示,通过其FewSound扩展展示了卓越的性能和适应性,显著超越了现有方法。
English: This study introduces the Kolmogorov-Arnold Network (KAN) as an effective implicit neural representation for audio, demonstrating superior performance and adaptability through its FewSound extension, which significantly outperforms existing methods.

Authors:Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, Kailun Yang
Title: Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance
Abstract:
The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2's inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.
中文: 提出的SHIFNet框架通过文本引导的跨模态融合和语义增强机制,成功将Segment Anything Model 2适配于RGB-热成像任务,以少量可训练参数实现了最先进的分割性能。
English: The proposed SHIFNet framework enhances robotic perception by adapting the Segment Anything Model 2 for RGB-Thermal tasks through text-guided cross-modal fusion and semantic prompting, achieving state-of-the-art segmentation performance with minimal trainable parameters.

Authors:Ege Özsoy, Chantal Pellegrini, Tobias Czempiel, Felix Tristram, Kun Yuan, David Bani-Harouni, Ulrich Eck, Benjamin Busam, Matthias Keicher, Nassir Navab
Title: MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
Abstract:
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. Our code, and data is available at https://github.com/egeozsoy/MM-OR.
中文摘要:本研究推出首个多模态手术室数据集MM-OR及场景图生成模型MM2SG,通过整合多元数据与先进算法,为复杂医疗场景的全面感知建立了新标准。
English Summary: The study introduces MM-OR, a comprehensive multimodal dataset for operating rooms, and MM2SG, a novel vision-language model, establishing new benchmarks for holistic surgical environment understanding through enhanced scene graph generation.

Authors:Xinying Hong, Siyu Li, Kang Zeng, Hao Shi, Bomin Peng, Kailun Yang, Zhiyong Li
Title: TS-CGNet: Temporal-Spatial Fusion Meets Centerline-Guided Diffusion for BEV Mapping
Abstract:
Bird's Eye View (BEV) perception technology is crucial for autonomous driving, as it generates top-down 2D maps for environment perception, navigation, and decision-making. Nevertheless, the majority of current BEV map generation studies focusing on visual map generation lack depth-aware reasoning capabilities. They exhibit limited efficacy in managing occlusions and handling complex environments, with a notable decline in perceptual performance under adverse weather conditions or low-light scenarios. Therefore, this paper proposes TS-CGNet, which leverages Temporal-Spatial fusion with Centerline-Guided diffusion. This visual framework, grounded in prior knowledge, is designed for integration into any existing network for building BEV maps. Specifically, this framework is decoupled into three parts: Local mapping system involves the initial generation of semantic maps using purely visual information; The Temporal-Spatial Aligner Module (TSAM) integrates historical information into mapping generation by applying transformation matrices; The Centerline-Guided Diffusion Model (CGDM) is a prediction module based on the diffusion model. CGDM incorporates centerline information through spatial-attention mechanisms to enhance semantic segmentation reconstruction. We construct BEV semantic segmentation maps by our methods on the public nuScenes and the robustness benchmarks under various corruptions. Our method improves 1.90%, 1.73%, and 2.87% for perceived ranges of 60x30m, 120x60m, and 240x60m in the task of BEV HD mapping. TS-CGNet attains an improvement of 1.92% for perceived ranges of 100x100m in the task of BEV semantic mapping. Moreover, TS-CGNet achieves an average improvement of 2.92% in detection accuracy under varying weather conditions and sensor interferences in the perception range of 240x60m. The source code will be publicly available at https://github.com/krabs-H/TS-CGNet.
中文摘要:本文提出TS-CGNet框架,通过时空融合与中心线引导扩散技术,有效提升鸟瞰图感知的深度推理能力,在不同环境和感知范围内均实现性能提升。
English Summary: This paper introduces TS-CGNet, a temporal-spatial fusion framework with centerline-guided diffusion that enhances Bird's Eye View perception by addressing depth reasoning limitations and improving performance across various conditions and ranges.

Authors:Grzegorz Skorupko, Fotios Avgoustidis, Carlos Martín-Isla, Lidia Garrucho, Dimitri A. Kessler, Esmeralda Ruiz Pujadas, Oliver Díaz, Maciej Bobowicz, Katarzyna Gwoździewicz, Xavier Bargalló, Paulius Jaruševičius, Richard Osuala, Kaisar Kushibar, Karim Lekadir
Title: Federated nnU-Net for Privacy-Preserving Medical Image Segmentation
Abstract:
The nnU-Net framework has played a crucial role in medical image segmentation and has become the gold standard in multitudes of applications targeting different diseases, organs, and modalities. However, so far it has been used primarily in a centralized approach where the collected data is stored in the same location where nnU-Net is trained. This centralized approach has various limitations, such as potential leakage of sensitive patient information and violation of patient privacy. Federated learning has emerged as a key approach for training segmentation models in a decentralized manner, enabling collaborative development while prioritising patient privacy. In this paper, we propose FednnU-Net, a plug-and-play, federated learning extension of the nnU-Net framework. To this end, we contribute two federated methodologies to unlock decentralized training of nnU-Net, namely, Federated Fingerprint Extraction (FFE) and Asymmetric Federated Averaging (AsymFedAvg). We conduct a comprehensive set of experiments demonstrating high and consistent performance of our methods for breast, cardiac and fetal segmentation based on a multi-modal collection of 6 datasets representing samples from 18 different institutions. To democratize research as well as real-world deployments of decentralized training in clinical centres, we publicly share our framework at https://github.com/faildeny/FednnUNet .
中文:FednnU-Net框架通过联邦指纹提取和异步联邦平均方法扩展了nnU-Net,实现了跨医疗机构的去中心化医学图像分割,在保护数据隐私的同时展现出卓越性能。
English: The FednnU-Net framework extends nnU-Net with federated learning capabilities through Federated Fingerprint Extraction and Asymmetric Federated Averaging, enabling decentralized medical image segmentation while maintaining data privacy across multiple institutions.

Authors:Xin Ding, Xin Li, Haotong Qin, Zhibo Chen
Title: Q&C: When Quantization Meets Cache in Efficient Image Generation
Abstract:
Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at https://github.com/xinding-sys/Quant-Cache.
中文摘要:本研究提出了一种结合量化和缓存机制的混合加速方法,通过时序感知并行聚类和方差补偿策略解决关键挑战,在保持生成质量的同时实现了Diffusion Transformers 12.7倍的加速效果。
English Summary: This study introduces a hybrid acceleration method for Diffusion Transformers that combines quantization and cache mechanisms, addressing key challenges through temporal-aware parallel clustering and variance compensation to achieve a 12.7x speedup while maintaining generation quality.

Authors:Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang
Title: LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
Abstract:
Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
中文: LADM框架通过基于注意力的依赖关系测量高效筛选高质量长文本数据,仅用少量训练标记即可显著提升大语言模型在长文本任务中的表现。
English: The LADM framework efficiently selects high-quality long-context data using attention-based dependency measurement, significantly enhancing LLM performance on long-context tasks with minimal training tokens.

Authors:Yujiao Yang, Jing Lian, Linhui Li
Title: Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Abstract:
Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. Conventional mixture-of-experts (MoE) architectures suffer from suboptimal coordination dynamics, where isolated expert operations expose the model to overfitting risks. Moreover, they have not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies a hierarchical routing mechanism to allocate input subspaces to specialized experts. Our approach advances MoE design with four key innovations: (1) Constructing expert groups by partitioning non-MoE models into functionally equivalent specialists (2) Developing a hierarchical routing paradigm that integrates patch-wise data selection and expert selection strategies. (3) Extending the MoE design to attention blocks. (4) Proposing a hardware-optimized parallelization scheme that exploits batched matrix multiplications for efficient expert computation. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs and efficient transformers in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, with only 50% of the FLOPs of the best MoE method. In image classification, it yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.
中文: 提出的专家联盟(UoE)模型通过引入分层路由机制并将专家设计扩展至注意力模块,解决了传统MoE架构的局限性,在语言和图像任务中以更低计算成本实现了更优性能。
English: The proposed Union-of-Experts (UoE) model addresses limitations in conventional MoE architectures by introducing hierarchical routing and extending expert design to attention blocks, achieving superior performance with reduced computational costs across language and image tasks.

Authors:Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, Balaji Krishnamurthy
Title: It Helps to Take a Second Opinion: Teaching Smaller LLMs to Deliberate Mutually via Selective Rationale Optimisation
Abstract:
Very large language models (LLMs) such as GPT-4 have shown the ability to handle complex tasks by generating and self-refining step-by-step rationales. Smaller language models (SLMs), typically with < 13B parameters, have been improved by using the data generated from very-large LMs through knowledge distillation. However, various practical constraints such as API costs, copyright, legal and ethical policies restrict using large (often opaque) models to train smaller models for commercial use. Limited success has been achieved at improving the ability of an SLM to explore the space of possible rationales and evaluate them by itself through self-deliberation. To address this, we propose COALITION, a trainable framework that facilitates interaction between two variants of the same SLM and trains them to generate and refine rationales optimized for the end-task. The variants exhibit different behaviors to produce a set of diverse candidate rationales during the generation and refinement steps. The model is then trained via Selective Rationale Optimization (SRO) to prefer generating rationale candidates that maximize the likelihood of producing the ground-truth answer. During inference, COALITION employs a controller to select the suitable variant for generating and refining the rationales. On five different datasets covering mathematical problems, commonsense reasoning, and natural language inference, COALITION outperforms several baselines by up to 5%. Our ablation studies reveal that cross-communication between the two variants performs better than using the single model to self-refine the rationales. We also demonstrate the applicability of COALITION for LMs of varying scales (4B to 14B parameters) and model families (Mistral, Llama, Qwen, Phi). We release the code for this work at https://github.com/Sohanpatnaik106/coalition.
Chinese: COALITION是一种可训练框架,通过让同一小型语言模型的两个变体交互生成并优化多样化的推理路径,在五个数据集上比基线模型性能提升高达5%。
English: COALITION is a trainable framework that enables two variants of the same small language model to interact, generating and refining diverse rationales optimized for end-tasks, outperforming baselines by up to 5% across five datasets.

Authors:Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, Tat-Seng Chua
Title: Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization
Abstract:
Personalizing Large Language Models (LLMs) has become a critical step in facilitating their widespread application to enhance individual life experiences. In pursuit of personalization, distilling key preference information from an individual's historical data as instructional preference context to customize LLM generation has emerged as a promising direction. However, these methods face a fundamental limitation by overlooking the inter-user comparative analysis, which is essential for identifying the inter-user differences that truly shape preferences. To address this limitation, we propose Difference-aware Personalization Learning (DPL), a novel approach that emphasizes extracting inter-user differences to enhance LLM personalization. DPL strategically selects representative users for comparison and establishes a structured standard to extract meaningful, task-relevant differences for customizing LLM generation. Extensive experiments on real-world datasets demonstrate that DPL significantly enhances LLM personalization. We release our code at https://github.com/SnowCharmQ/DPL.
中文摘要:本研究提出差异感知个性化学习(DPL)方法,通过分析用户间差异来增强大语言模型的个性化定制,有效解决了现有方法忽略用户对比分析的局限性。
English Summary: The study introduces Difference-aware Personalization Learning (DPL), a method that improves LLM personalization by analyzing inter-user differences, which overcomes the limitation of prior approaches that neglect comparative user analysis.

Authors:Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, Wenyong Yu
Title: Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
Abstract:
Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Code is available at:https://github.com/luow23/INP-Former.
Chinese: INP-Former提出了一种新颖的异常检测方法,直接从测试图像中提取内在正常原型来重建正常区域,在多个基准测试中实现了最先进的性能,并展现出零样本检测能力。
English: INP-Former introduces a novel anomaly detection method that extracts intrinsic normal prototypes directly from test images to reconstruct normal regions, achieving state-of-the-art performance across various benchmarks and demonstrating zero-shot capability.

Authors:Xiaoying Li, Long Xu, Xiaolin Huang, Donglai Xue, Zhihao Zhang, Zhichao Han, Chao Xu, Yanjun Cao, Fei Gao
Title: SEB-Naver: A SE(2)-based Local Navigation Framework for Car-like Robots on Uneven Terrain
Abstract:
Autonomous navigation of car-like robots on uneven terrain poses unique challenges compared to flat terrain, particularly in traversability assessment and terrain-associated kinematic modelling for motion planning. This paper introduces SEB-Naver, a novel SE(2)-based local navigation framework designed to overcome these challenges. First, we propose an efficient traversability assessment method for SE(2) grids, leveraging GPU parallel computing to enable real-time updates and maintenance of local maps. Second, inspired by differential flatness, we present an optimization-based trajectory planning method that integrates terrain-associated kinematic models, significantly improving both planning efficiency and trajectory quality. Finally, we unify these components into SEB-Naver, achieving real-time terrain assessment and trajectory optimization. Extensive simulations and real-world experiments demonstrate the effectiveness and efficiency of our approach. The code is at https://github.com/ZJU-FAST-Lab/seb_naver.
Chinese: 本文提出SEB-Naver框架,通过GPU并行计算实现实时可通行性评估,并结合地形关联运动学模型进行轨迹优化,有效提升了轮式机器人在不平坦地形中的导航性能。
English: This paper presents SEB-Naver, a real-time navigation framework for car-like robots on uneven terrain that combines GPU-accelerated traversability assessment with terrain-aware kinematic planning to enhance efficiency and trajectory quality.

Authors:Jiesi Hu, Chenfei Ye, Yanwu Yang, Xutao Guo, Yang Shang, Pengcheng Shi, Hanyang Peng, Ting Ma
Title: Neuroverse3D: Developing In-Context Learning Universal Model for Neuroimaging in 3D
Abstract:
In-context learning (ICL), a type of universal model, demonstrates exceptional generalization across a wide range of tasks without retraining by leveraging task-specific guidance from context, making it particularly effective for the intricate demands of neuroimaging. However, current ICL models, limited to 2D inputs and thus exhibiting suboptimal performance, struggle to extend to 3D inputs due to the high memory demands of ICL. In this regard, we introduce Neuroverse3D, an ICL model capable of performing multiple neuroimaging tasks in 3D (e.g., segmentation, denoising, inpainting). Neuroverse3D overcomes the large memory consumption associated with 3D inputs through adaptive parallel-sequential context processing and a U-shaped fusion strategy, allowing it to handle an unlimited number of context images. Additionally, we propose an optimized loss function to balance multi-task training and enhance focus on anatomical boundaries. Our study incorporates 43,674 3D multi-modal scans from 19 neuroimaging datasets and evaluates Neuroverse3D on 14 diverse tasks using held-out test sets. The results demonstrate that Neuroverse3D significantly outperforms existing ICL models and closely matches task-specific models, enabling flexible adaptation to medical center variations without retraining. The code and model weights are publicly available at https://github.com/jiesihu/Neuroverse3D.
中文: Neuroverse3D是一种创新的上下文学习模型,通过自适应并行-顺序上下文处理技术突破内存限制,能在无需重新训练的情况下处理3D神经影像任务,性能显著优于现有模型并接近专用模型水平。
English: Neuroverse3D is a novel in-context learning model that overcomes memory limitations to handle 3D neuroimaging tasks, outperforming existing models and matching task-specific performance without retraining.

Authors:Nikita Kazeev, Wei Nong, Ignat Romanov, Ruiming Zhu, Andrey Ustyuzhanin, Shuya Yamazaki, Kedar Hippalgaonkar
Title: Wyckoff Transformer: Generation of Symmetric Crystals
Abstract:
Crystal symmetry plays a fundamental role in determining its physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. However, this is often inadequately addressed by existing generative models, making the consistent generation of stable and symmetrically valid crystal structures a significant challenge. We introduce WyFormer, a generative model that directly tackles this by formally conditioning on space group symmetry. It achieves this by using Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer encoder and an absence of positional encoding. Extensive experimentation demonstrates WyFormer's compelling combination of attributes: it achieves best-in-class symmetry-conditioned generation, incorporates a physics-motivated inductive bias, produces structures with competitive stability, predicts material properties with competitive accuracy even without atomic coordinates, and exhibits unparalleled inference speed.
中文: WyFormer是一种生成模型,通过基于空间群对称性并使用Wyckoff位置进行结构表示,解决了生成稳定且对称有效晶体结构的难题,实现了卓越的对称性遵循、竞争性稳定性和快速推理能力。
English: WyFormer is a generative model that overcomes the challenge of producing stable, symmetry-valid crystal structures by conditioning on space group symmetry and using Wyckoff positions for representation, achieving superior symmetry compliance, competitive stability, and fast inference.

Authors:Nico Sutter, Valentin N. Hartmann, Stelian Coros
Title: A comparison of visual representations for real-world reinforcement learning in the context of vacuum gripping
Abstract:
When manipulating objects in the real world, we need reactive feedback policies that take into account sensor information to inform decisions. This study aims to determine how different encoders can be used in a reinforcement learning (RL) framework to interpret the spatial environment in the local surroundings of a robot arm. Our investigation focuses on comparing real-world vision with 3D scene inputs, exploring new architectures in the process. We built on the SERL framework, providing us with a sample efficient and stable RL foundation we could build upon, while keeping training times minimal. The results of this study indicate that spatial information helps to significantly outperform the visual counterpart, tested on a box picking task with a vacuum gripper. The code and videos of the evaluations are available at https://github.com/nisutte/voxel-serl.
中文: 本研究证明,在强化学习框架中采用空间编码器能使机械臂通过有效解析局部环境信息,在箱子拾取任务中显著优于基于视觉的方法。
English: This research demonstrates that using spatial encoders within a reinforcement learning framework enables robotic arms to outperform vision-based methods in box-picking tasks by effectively interpreting local environmental data.

Authors:Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang
Title: An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning
Abstract:
Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models' reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at https://github.com/xiaolizh1/EpicPRM.
中文: EpicPRM提出了一种创新框架,通过量化每个推理步骤的贡献并采用自适应二分搜索算法,高效构建了高质量的Epic50k过程监督训练数据集,相比现有方法显著提升了PRM模型的性能表现。
English: EpicPRM introduces a novel framework that efficiently constructs high-quality process supervision training data by quantifying each reasoning step's contribution and using adaptive binary search, resulting in the Epic50k dataset which significantly enhances PRM performance compared to existing methods.

Authors:Xinyu Wang, Bohan Zhuang, Qi Wu
Title: Are Large Vision Language Models Good Game Players?
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose \method{}, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments. \method{} uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc. Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at https://github.com/xinke-wang/LVLM-Playground.
中文: 作者提出了\method{},一种基于游戏的评估框架,通过感知、问答、规则遵循和端到端游戏四个核心任务全面评估大型视觉语言模型的认知与推理能力,解决了现有基准在详细视觉感知评估和多轮推理关注不足等局限性。
English: The authors introduce \method{}, a game-based evaluation framework that comprehensively assesses Large Vision Language Models' cognitive and reasoning abilities through four core tasks, addressing limitations in current benchmarks like inadequate visual perception assessment and lack of multi-turn reasoning focus.

Authors:Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai
Title: Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
Abstract:
Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.
中文: Q-EVAL-100K数据集通过大规模人工标注,使Q-Eval-Score模型在评估文本到视觉内容的质量和对齐度方面表现出卓越性能和泛化能力。
English: The Q-EVAL-100K dataset, featuring extensive human annotations, enables the Q-Eval-Score model to excel in evaluating text-to-vision content's visual quality and alignment with superior performance and generalization.

Authors:Yunzhen He, Yusuke Takase, Yoichi Ishibashi, Hidetoshi Shimodaira
Title: DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability
Abstract:
Large Language Models (LLMs) are increasingly being used in real-world applications. However, concerns about the reliability of the content they generate persist, as it frequently deviates from factual correctness or exhibits deficiencies in logical reasoning. This paper proposes a novel decoding strategy aimed at enhancing both factual accuracy and inferential reasoning without requiring any modifications to the architecture or pre-trained parameters of LLMs. Our approach adjusts next-token probabilities by analyzing the trajectory of logits from lower to higher layers in Transformers and applying linear regression. We find that this Decoding by Logit Trajectory-based approach (DeLTa) effectively reinforces factuality and reasoning while mitigating incorrect generation. Experiments on TruthfulQA demonstrate that DeLTa attains up to a 4.9% improvement over the baseline. Furthermore, it enhances performance by up to 8.1% on StrategyQA and 7.3% on GSM8K, both of which demand strong reasoning capabilities.
Chinese: 本文提出DeLTa解码策略,通过基于对数轨迹分析调整令牌概率来增强大语言模型的事实准确性和推理能力,在不改变模型架构的情况下,在推理任务上实现了高达8.1%的性能提升。
English: This paper introduces DeLTa, a novel decoding strategy that enhances the factual accuracy and reasoning capabilities of Large Language Models by adjusting token probabilities based on logit trajectory analysis, achieving improvements of up to 8.1% on reasoning tasks without modifying model architecture.

Authors:Gen Shi, Hui Zhang, Jie Tian
Title: COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation
Abstract:
Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D vessel dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to encode entire image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA's superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code and data will be open source at https://github.com/shigen-StoneRoot/COMMA.
中文摘要:本研究提出坐标感知调制Mamba网络(COMMA),通过全局和局部分支增强空间定位能力,在多个数据集的血管分割任务中表现出卓越性能,尤其在小血管分割方面优势明显。
English Summary: This study introduces the Coordinate-aware Modulated Mamba Network (COMMA) for 3D vascular segmentation, which enhances spatial location awareness through global and local branches and demonstrates superior performance in segmenting small vessels across multiple datasets.

Authors:Guotao Shen, Ziheng Yan, Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm
Title: Exploring Simple Siamese Network for High-Resolution Video Quality Assessment
Abstract:
In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: https://github.com/srcn-ivl/SiamVQA
SiamVQA introduces a Siamese network with shared weights between technical and aesthetic branches to enhance semantic perception in technical quality assessment, achieving state-of-the-art performance on high-resolution video benchmarks.
English Summary:

Authors:Xueliang Zhao, Wei Wu, Jian Guan, Lingpeng Kong
Title: PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models
Abstract:
The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the Olympiad level, hinders further advancements. In this work, we introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction, emulating the thought processes of experienced problem designers. We provide a theoretical analysis demonstrating that an optimal rationale should maximize both the likelihood of rationale generation given the associated concepts and the likelihood of problem generation conditioned on both the rationale and the concepts. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods. Furthermore, we demonstrate that PromptCoT exhibits superior data scalability, consistently maintaining high performance as the dataset size increases, outperforming the baselines. The implementation is available at https://github.com/zhaoxlpku/PromptCoT.
Chinese: PromptCoT提出了一种模拟专家推理自动生成高质量奥林匹克数学题的新方法,在基准测试中优于现有方法,并展现出随数据增长更优的扩展性。
English: PromptCoT introduces a novel method for automatically generating high-quality Olympiad-level math problems by emulating expert reasoning, which outperforms existing approaches on benchmarks and demonstrates superior scalability with increasing data.

Authors:Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm
Title: Unified Arbitrary-Time Video Frame Interpolation and Prediction
Abstract:
Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: https://github.com/srcn-ivl/uniVIP
中文: 本文提出了uniVIP模型,统一处理任意时间点的视频插帧和预测任务,在插帧方面表现优异,在预测方面超越了现有最优方法。
English: This paper introduces uniVIP, a unified model for arbitrary-time video interpolation and prediction that achieves competitive results in interpolation and outperforms state-of-the-art methods in prediction.

Authors:Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang
Title: A Token-level Text Image Foundation Model for Document Understanding
Abstract:
In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.
中文:TokenOCR是首个专为文本图像任务设计的令牌级视觉基础模型,通过创新的数据集和流程解决了预测错误问题,显著提升了文档理解等下游应用的性能。
English: TokenOCR is the first token-level visual foundation model developed to address fundamental prediction errors in text-image tasks by leveraging a novel dataset and pipeline, enhancing performance in downstream applications like document understanding.

Authors:Tong Liang, Jim Davis
Title: Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts
Abstract:
Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes' severities are derived from the given label hierarchy of downstream tasks. Since CLIP's image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}.
Chinese: 本研究提出HAPrompts方法,利用大语言模型为基于CLIP的零样本图像分类生成层次感知文本提示,通过依据标签层级结构产生语义合理的分类错误,在多个数据集上展现出优越性能。
English: This study introduces HAPrompts, a novel approach that leverages large language models to generate hierarchical-aware textual prompts for CLIP-based zero-shot image classification, focusing on making semantically meaningful mistakes according to label hierarchies and demonstrating superior performance across multiple datasets.

Authors:Yixuan Huang, Jie Yang, Chao-Kai Wen, Shi Jin
Title: Integrated Communication and Learned Recognizer with Customized RIS Phases and Sensing Durations
Abstract:
Future wireless communication networks are expected to be smarter and more aware of their surroundings, enabling a wide range of context-aware applications. Reconfigurable intelligent surfaces (RISs) are set to play a critical role in supporting various sensing tasks, such as target recognition. However, current methods typically use RIS configurations optimized once and applied over fixed sensing durations, limiting their ability to adapt to different targets and reducing sensing accuracy. To overcome these limitations, this study proposes an advanced wireless communication system that multiplexes downlink signals for environmental sensing and introduces an intelligent recognizer powered by deep learning techniques. Specifically, we design a novel neural network based on the long short-term memory architecture and the physical channel model. This network iteratively captures and fuses information from previous measurements, adaptively customizing RIS phases to gather the most relevant information for the recognition task at subsequent moments. These configurations are dynamically adjusted according to scene, task, target, and quantization priors. Furthermore, the recognizer includes a decision-making module that dynamically allocates different sensing durations, determining whether to continue or terminate the sensing process based on the collected measurements. This approach maximizes resource utilization efficiency. Simulation results demonstrate that the proposed method significantly outperforms state-of-the-art techniques while minimizing the impact on communication performance, even when sensing and communication occur simultaneously. Part of the source code for this paper can be accessed at https://github.com/kiwi1944/CRISense.
中文: 本研究提出一种基于深度学习的先进无线系统,通过动态调整可重构智能表面和感知时长,在保障通信性能的同时显著提升了环境识别精度。
English: This study introduces an advanced wireless system that uses deep learning to dynamically adjust reconfigurable intelligent surfaces (RIS) and sensing durations, significantly improving environmental recognition accuracy while maintaining communication performance.

Authors:Zirui Wu, Xiao Liu, Jiayi Li, Lingpeng Kong, Yansong Feng
Title: Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions
Abstract:
While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open-source our benchmark and code at https://github.com/WilliamZR/Recipe2Plan.
中文: Recipe2Plan基准通过真实烹饪场景提出创新框架,评估智能体的多任务规划与执行效率,揭示了在并行优化与时间约束间保持平衡的挑战,并强调了大语言模型需提升时间感知能力。
English: The Recipe2Plan benchmark introduces a novel framework based on cooking scenarios to evaluate agents' multitask planning and execution efficiency, revealing challenges in balancing parallelization with temporal constraints and highlighting the need for improved temporal awareness in large language models.

Authors:Bo Cheng, Jueqing Lu, Yuan Tian, Haifeng Zhao, Yi Chang, Lan Du
Title: CGMatch: A Different Perspective of Semi-supervised Learning
Abstract:
Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model's confidence, making them challenging to accurately assess the model's state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited. Source code is available at https://github.com/BoCheng-96/CGMatch.
Chinese: 半监督学习在标记数据稀缺时表现不佳,而提出的CGMatch模型通过引入计数间隙指标和细粒度动态选择策略,能有效识别有益的无标记样本并提升性能。
English: Semi-supervised learning faces challenges with scarce labeled data, but the proposed CGMatch model introduces a Count-Gap metric and fine-grained dynamic selection strategy to effectively identify beneficial unlabeled examples and improve performance.

Authors:Haoyuan Li, Ziqin Ye, Yue Hao, Weiyang Lin, Chao Ye
Title: DQO-MAP: Dual Quadrics Multi-Object mapping with Gaussian Splatting
Abstract:
Accurate object perception is essential for robotic applications such as object navigation. In this paper, we propose DQO-MAP, a novel object-SLAM system that seamlessly integrates object pose estimation and reconstruction. We employ 3D Gaussian Splatting for high-fidelity object reconstruction and leverage quadrics for precise object pose estimation. Both of them management is handled on the CPU, while optimization is performed on the GPU, significantly improving system efficiency. By associating objects with unique IDs, our system enables rapid object extraction from the scene. Extensive experimental results on object reconstruction and pose estimation demonstrate that DQO-MAP achieves outstanding performance in terms of precision, reconstruction quality, and computational efficiency. The code and dataset are available at: https://github.com/LiHaoy-ux/DQO-MAP.
中文: DQO-MAP是一种新型物体SLAM系统,通过结合3D高斯泼溅实现高保真物体重建和二次曲面精确定位,在精度、重建质量和计算效率方面均表现出卓越性能。
English: DQO-MAP is a novel object-SLAM system that integrates 3D Gaussian Splatting for high-fidelity object reconstruction and quadrics for precise pose estimation, achieving superior performance in precision, reconstruction quality, and computational efficiency.

Authors:Zhihua Shen, Siyang Chen, Han Wang, Tongsu Zhang, Xiaohu Zhang, Xiangpeng Xu, Xia Yang
Title: Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection
Abstract:
Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.
中文: LVNet提出了一种结合多尺度CNN前端和U型视频Transformer的混合架构,通过优化底层特征学习和时空上下文建模,在红外小目标多帧检测中实现了更高精度,同时大幅降低了参数量和计算开销。
English: LVNet introduces a hybrid CNN-Transformer architecture with a multi-scale CNN frontend for enhanced low-level feature extraction and a U-shaped video Transformer for spatiotemporal modeling, achieving superior performance in multi-frame infrared small target detection with significantly reduced parameters and computational costs.

Authors:Shuo Wang, Tong Ren, Nan Cheng, Rong Wang, Li Zhang
Title: Time-Varying Coronary Artery Deformation: A Dynamic Skinning Framework for Surgical Training
Abstract:
Purpose: This study proposes a novel anatomically-driven dynamic modeling framework for coronary arteries using skeletal skinning weights computation, aiming to achieve precise control over vessel deformation while maintaining real-time performance for surgical simulation applications. Methods: We developed a computational framework based on biharmonic energy minimization for skinning weight calculation, incorporating volumetric discretization through tetrahedral mesh generation. The method implements temporal sampling and interpolation for continuous vessel deformation throughout the cardiac cycle, with mechanical constraints and volume conservation enforcement. The framework was validated using clinical datasets from 5 patients, comparing interpolated deformation results against ground truth data obtained from frame-by-frame segmentation across cardiac phases. Results: The proposed framework effectively handled interactive vessel manipulation. Geometric accuracy evaluation showed mean Hausdorff distance of 4.96 +- 1.78 mm and mean surface distance of 1.78 +- 0.75 mm between interpolated meshes and ground truth models. The Branch Completeness Ratio achieved 1.82 +- 0.46, while Branch Continuity Score maintained 0.84 +- 0.06 (scale 0-1) across all datasets. The system demonstrated capability in supporting real-time guidewire-vessel collision detection and contrast medium flow simulation throughout the complete coronary tree structure. Conclusion: Our skinning weight-based methodology enhances model interactivity and applicability while maintaining geometric accuracy. The framework provides a more flexible technical foundation for virtual surgical training systems, demonstrating promising potential for both clinical practice and medical education applications. The code is available at https://github.com/ipoirot/DynamicArtery.
本研究提出一种基于骨骼蒙皮权重的动态冠状动脉建模框架,通过临床数据验证表明该方法在保持几何精度的同时,能够实现实时交互式血管变形控制,适用于手术模拟应用。
This study introduces a dynamic coronary artery modeling framework using skeletal skinning weights to enable real-time deformation control for surgical simulations, validated with clinical data showing high geometric accuracy and interactive performance.

Authors:Toan Nguyen, Kien Do, Duc Kieu, Thin Nguyen
Title: h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
Abstract:
We introduce a theoretical framework for diffusion-based image editing by formulating it as a reverse-time bridge modeling problem. This approach modifies the backward process of a pretrained diffusion model to construct a bridge that converges to an implicit distribution associated with the editing target at time 0. Building on this framework, we propose h-Edit, a novel editing method that utilizes Doob's h-transform and Langevin Monte Carlo to decompose the update of an intermediate edited sample into two components: a "reconstruction" term and an "editing" term. This decomposition provides flexibility, allowing the reconstruction term to be computed via existing inversion techniques and enabling the combination of multiple editing terms to handle complex editing tasks. To our knowledge, h-Edit is the first training-free method capable of performing simultaneous text-guided and reward-model-based editing. Extensive experiments, both quantitative and qualitative, show that h-Edit outperforms state-of-the-art baselines in terms of editing effectiveness and faithfulness. Our source code is available at https://github.com/nktoan/h-edit.
中文摘要:我们提出h-Edit这一无需训练的扩散图像编辑方法,通过Doob's h变换将编辑过程分解为重建和编辑两个组件,在文本引导与奖励模型协同编辑方面实现了最优性能。
English Summary: We propose h-Edit, a training-free diffusion-based image editing method that employs Doob's h-transform to decompose edits into reconstruction and editing components, achieving superior performance in simultaneous text-guided and reward-model-based editing.

Authors:Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Title: DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.
中文: 提出的DivPrune方法将令牌剪枝构建为最大最小多样性问题,通过最大化所选视觉令牌的多样性,在无需微调的情况下实现了跨多个数据集的最优精度,同时降低了延迟和内存使用。
English: The proposed DivPrune method formulates token pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy across multiple datasets while reducing latency and memory usage without requiring fine-tuning.

Authors:Jiacheng Zhang, Benjamin I. P. Rubinstein, Jingfeng Zhang, Feng Liu
Title: One Stone, Two Birds: Enhancing Adversarial Defense Through the Lens of Distributional Discrepancy
Abstract:
Statistical adversarial data detection (SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we explore the strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Despite these advantages, SADD-based methods have a potential limitation: they discard inputs that are detected as AEs, leading to the loss of useful information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, named Distributional-discrepancy-based Adversarial Defense (DAD). In the training phase, DAD first optimizes the test power of the maximum mean discrepancy (MMD) to derive MMD-OPT, which is a stone that kills two birds. MMD-OPT first serves as a guiding signal to minimize the distributional discrepancy between CEs and AEs to train a denoiser. Then, it serves as a discriminator to differentiate CEs and AEs during inference. Overall, in the inference stage, DAD consists of a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DAD outperforms current state-of-the-art (SOTA) defense methods by simultaneously improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks. Codes are publicly available at: https://github.com/tmlr-group/DAD.
中文: 提出的基于分布差异的对抗防御方法(DAD)通过训练去噪器最小化干净样本与对抗样本的分布差异,在推理阶段对检测出的对抗样本进行净化处理,在CIFAR-10和ImageNet-1K数据集上实现了超越现有方法的清洁准确率与鲁棒准确率。
English: The proposed Distributional-discrepancy-based Adversarial Defense (DAD) method enhances security by training a denoiser to minimize distributional gaps between clean and adversarial examples, then using it to purify detected adversarial inputs during inference, achieving superior clean and robust accuracy on benchmark datasets.

Authors:Chia-Wei Hsu, Nien-Ti Tsou, Yu-Cheng Chen, Yang Jeong Park, Ju Li
Title: Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks
Abstract:
Gradient-based optimization drives the unprecedented performance of modern deep neural network models across diverse applications. Adaptive algorithms have accelerated neural network training due to their rapid convergence rates; however, they struggle to find ``flat minima" reliably, resulting in suboptimal generalization compared to stochastic gradient descent (SGD). By revisiting various adaptive algorithms' mechanisms, we propose the Frankenstein optimizer, which combines their advantages. The proposed Frankenstein dynamically adjusts first- and second-momentum coefficients according to the optimizer's current state to directly maintain consistent learning dynamics and immediately reflect sudden gradient changes. Extensive experiments across several research domains such as computer vision, natural language processing, few-shot learning, and scientific simulations show that Frankenstein surpasses existing adaptive algorithms and SGD empirically regarding convergence speed and generalization performance. Furthermore, this research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process. Code is available at https://github.com/acctouhou/Frankenstein_optimizer
中文:弗兰肯斯坦优化器通过动态调整动量系数来保持稳定的学习动态并快速响应梯度变化,在多个领域中其收敛速度和泛化能力均超越了现有自适应算法及随机梯度下降法。
English: The Frankenstein optimizer dynamically adjusts momentum coefficients to maintain consistent learning dynamics and rapidly respond to gradient changes, outperforming both adaptive algorithms and SGD in convergence speed and generalization across multiple domains.

Authors:Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville
Title: Forgetting Transformer: Softmax Attention with a Forget Gate
Abstract:
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
Chinese: 遗忘变换器(FoX)通过数据依赖性地降低未归一化注意力分数,将遗忘门自然融入变换器中,在长上下文语言建模和短上下文任务中表现卓越,同时与FlashAttention算法兼容且无需位置嵌入。
English: The Forgetting Transformer (FoX) incorporates a forget gate into Transformers by adaptively down-weighting attention scores, achieving superior performance in long-context language modeling and short-context tasks while maintaining compatibility with FlashAttention and eliminating the need for positional embeddings.

Authors:Boyong He, Yuxiang Ji, Qianwen Ye, Zhuoyue Tan, Liaoni Wu
Title: Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
Abstract:
Domain generalization (DG) for object detection aims to enhance detectors' performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at https://github.com/heboyong/Generalized-Diffusion-Detector.
中文: 本研究提出了一种新颖的物体检测领域泛化方法,通过提取扩散模型的多步中间特征获取领域不变特征,并采用知识转移框架实现特征对齐,在不增加推理时间的情况下比现有方法显著提升14.0% mAP。
English: This study introduces a novel domain generalization method for object detection that leverages intermediate features from diffusion models to achieve domain-invariant representations and employs a knowledge transfer framework for feature alignment, resulting in significant performance improvements of up to 14.0% mAP over existing approaches without increasing inference time.

Authors:Rustin Soraki, Huayu Wang, Joann G. Elmore, Linda Shapiro
Title: CrossFusion: A Multi-Scale Cross-Attention Convolutional Fusion Model for Cancer Survival Prediction
Abstract:
Cancer survival prediction from whole slide images (WSIs) is a challenging task in computational pathology due to the large size, irregular shape, and high granularity of the WSIs. These characteristics make it difficult to capture the full spectrum of patterns, from subtle cellular abnormalities to complex tissue interactions, which are crucial for accurate prognosis. To address this, we propose CrossFusion, a novel multi-scale feature integration framework that extracts and fuses information from patches across different magnification levels. By effectively modeling both scale-specific patterns and their interactions, CrossFusion generates a rich feature set that enhances survival prediction accuracy. We validate our approach across six cancer types from public datasets, demonstrating significant improvements over existing state-of-the-art methods. Moreover, when coupled with domain-specific feature extraction backbones, our method shows further gains in prognostic performance compared to general-purpose backbones. The source code is available at: https://github.com/RustinS/CrossFusion
Chinese: CrossFusion是一种新颖的多尺度特征融合框架,通过有效整合不同放大倍数下的全切片图像信息,显著提升了六种癌症类型的生存预测准确性。
English: CrossFusion is a novel multi-scale feature integration framework that enhances cancer survival prediction accuracy by effectively fusing information from whole slide images at different magnification levels, demonstrating significant improvements across six cancer types.

Authors:Ruth Crasto
Title: Robustness to Geographic Distribution Shift Using Location Encoders
Abstract:
Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at test time. The most common approaches to tackling geographic distribution shift treat regions delimited by administrative boundaries such as countries or continents as separate domains and apply standard domain adaptation methods, ignoring geographic coordinates that are often available as metadata. This paper proposes the use of location encoders for modeling continuous, learnable domain assignment. We show how both non-parametric sine-cosine encoders and pre-trained location encoders can be used in conjunction with standard domain adaptation methods for improved robustness to geographic distribution shift. Our proposed methods achieve new state-of-the-art results on two geo-tagged remote sensing datasets from the WILDS benchmark. We have made our code publicly available at: https://github.com/crastoru/wilds-geoshift.
中文摘要:本文提出使用位置编码器来建模连续可学习的域分配,通过将其与标准域适应方法结合,在两个地理标记遥感数据集上实现了最先进的性能,有效应对地理分布偏移问题。
English Summary: This paper introduces location encoders to model continuous domain assignments for addressing geographic distribution shift, achieving state-of-the-art results on geo-tagged datasets by integrating them with standard domain adaptation methods.

Authors:Zhusi Zhong, Yuli Wang, Lulu Bi, Zhuoqi Ma, Sun Ho Ahn, Christopher J. Mullin, Colin F. Greineder, Michael K. Atalay, Scott Collins, Grayson L. Baird, Cheng Ting Lin, Webster Stayman, Todd M. Kolb, Ihab Kamel, Harrison X. Bai, Zhicheng Jiao
Title: Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA
Abstract:
Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining), an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, our model demonstrates superior performance in detecting abnormalities, reducing missed findings, and generating structured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms state-of-the-art medical vision-language models and 3D report generation methods in both accuracy and clinical relevance. These results highlight the potential of integrating multimodal learning strategies for improving radiology reporting. The source code is available at https://github.com/zzs95/abn-blip.
中文: 本文提出的Abn-BLIP模型通过多模态学习对齐异常发现,在CTPA影像解读中展现出优于现有方法的准确性和临床实用性,显著提升了放射学报告的质量。
English: This paper presents Abn-BLIP, an advanced diagnostic model that enhances radiology reporting by aligning abnormal findings through multimodal learning, demonstrating superior accuracy and clinical relevance in CTPA interpretation compared to existing methods.

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Abstract:
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.
Chinese: 本文提出ReT模型,采用多模态查询和文档,通过基于Transformer的循环单元实现多层次特征融合,在多个基准测试中取得了领先性能。
English: The paper introduces ReT, a novel cross-modal retrieval model that uses multimodal queries and documents with a Transformer-based recurrent cell for enhanced feature integration, achieving state-of-the-art results on benchmarks.

Authors:Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
Title: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
中文摘要:该摘要介绍了MultiAgentBench这一新基准,用于评估多智能体大语言模型在多样化交互场景中的表现,衡量协作、竞争及多种协调协议,研究发现GPT-4o-mini获得最高任务分且认知规划使里程碑达成率提升3%。
English Summary: The abstract introduces MultiAgentBench, a new benchmark for evaluating multi-agent LLM systems across diverse interactive scenarios, measuring collaboration, competition, and various coordination protocols, with findings showing GPT-4o-mini achieving the highest task scores and cognitive planning boosting milestone completion by 3%.

Authors:Mingjie Wen, Jiahe Han, Wenjuan Li, Xiaoya Chang, Qingzhao Chu, Dongping Chen
Title: A General Neural Network Potential for Energetic Materials with C, H, N, and O elements
Abstract:
The discovery and optimization of high-energy materials (HEMs) are constrained by the prohibitive computational expense and prolonged development cycles inherent in conventional approaches. In this work, we develop a general neural network potential (NNP) that efficiently predicts the structural, mechanical, and decomposition properties of HEMs composed of C, H, N, and O. Our framework leverages pre-trained NNP models, fine-tuned using transfer learning on energy and force data derived from density functional theory (DFT) calculations. This strategy enables rapid adaptation across 20 different HEM systems while maintaining DFT-level accuracy, significantly reducing computational costs. A key aspect of this work is the ability of NNP model to capture the chemical activity space of HEMs, accurately describe the key atomic interactions and reaction mechanisms during thermal decomposition. The general NNP model has been applied in molecular dynamics (MD) simulations and validated with experimental data for various HEM structures. Results show that the NNP model accurately predicts the structural, mechanical, and decomposition properties of HEMs by effectively describing their chemical activity space. Compared to traditional force fields, it offers superior DFT-level accuracy and generalization across both microscopic and macroscopic properties, reducing the computational and experimental costs. This work provides an efficient strategy for the design and development of HEMs and proposes a promising framework for integrating DFT, machine learning, and experimental methods in materials research. (To facilitate further research and practical applications, we open-source our NNP model on GitHub: https://github.com/MingjieWen/General-NNP-model-for-C-H-N-O-Energetic-Materials.)
Chinese: 本研究开发了一种通用神经网络势,能够准确预测高能材料的结构、力学和分解特性,在保持密度泛函理论精度的同时显著降低了计算成本。
English: This study develops a general neural network potential that accurately predicts the structural, mechanical, and decomposition properties of high-energy materials, significantly reducing computational costs while maintaining DFT-level accuracy.

Authors:Wang YuHang, Junkang Guo, Aolei Liu, Kaihao Wang, Zaitong Wu, Zhenyu Liu, Wenfei Yin, Jian Liu
Title: TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions
Abstract:
Adversarial robustness is a critical challenge in deploying deep neural networks for real-world applications. While adversarial training is a widely recognized defense strategy, most existing studies focus on balanced datasets, overlooking the prevalence of long-tailed distributions in real-world data, which significantly complicates robustness. This paper provides a comprehensive analysis of adversarial training under long-tailed distributions and identifies limitations in the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which integrates an initial stabilization phase followed by a stratified equalization adversarial training phase. Additionally, prior work on long-tailed robustness has largely ignored the crucial evaluation metric of balanced accuracy. To bridge this gap, we introduce the concept of balanced robustness, a comprehensive metric tailored for assessing robustness under long-tailed distributions. Extensive experiments demonstrate that our method surpasses existing advanced defenses, achieving significant improvements in both memory and computational efficiency. This work represents a substantial advancement in addressing robustness challenges in real-world applications. Our code is available at: https://github.com/BuhuiOK/TAET-Two-Stage-Adversarial-Equalization-Training-on-Long-Tailed-Distributions.
中文: 本文提出TAET这一新型两阶段对抗训练框架,通过稳定化与分层均衡化训练显著提升长尾数据分布下的模型鲁棒性与效率,并创新性地提出平衡鲁棒性指标以更准确评估现实场景中的防御性能。
English: This paper introduces TAET, a novel two-stage adversarial training framework that enhances robustness and efficiency for deep neural networks on long-tailed data distributions, while proposing the balanced robustness metric to better evaluate performance under such real-world conditions.

Authors:Lily Xu, Bryan Wilder, Elias B. Khalil, Milind Tambe
Title: Reinforcement learning with combinatorial actions for coupled restless bandits
Abstract:
Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods -- which cannot address sequential planning and combinatorial selection simultaneously -- by an average of 24.8\% on these difficult instances.
中文摘要:强化学习在处理大规模组合动作空间时存在瓶颈,而SEQUOIA算法通过将Q网络嵌入混合整数规划直接优化长期奖励,在组合约束的 restless bandit 问题上比现有方法平均提升24.8%的性能。
English Summary: Reinforcement learning faces challenges with large combinatorial action spaces, but SEQUOIA overcomes this by embedding Q-networks into optimization programs to directly maximize long-term rewards, achieving 24.8% better performance on complex restless bandit problems.

Authors:Michal Spiegel, Michal Štefánik, Marek Kadlčík, Josef Kuchař
Title: Attend or Perish: Benchmarking Attention in Algorithmic Reasoning
Abstract:
Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to distinguish genuine algorithmic understanding from memorization. In this paper, we propose AttentionSpan, an algorithmic benchmark comprising five tasks of infinite input domains where we can disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models' ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii)to assess the robustness of their learned mechanisms. By analyzing attention maps and performing targeted interventions, we show that attention mechanism directly causes failures in extrapolation. We make the implementation of all our tasks and interpretability methods publicly available at https://github.com/michalspiegel/AttentionSpan .
Chinese: 本研究提出AttentionSpan基准,通过测试模型对未见输入的泛化能力及机制稳健性,评估Transformer是否真正学习算法而非仅记忆数据,揭示注意力机制是导致泛化失败的直接原因。
English: The study introduces AttentionSpan, a benchmark to evaluate if transformers genuinely learn algorithms or merely memorize data by testing their ability to extrapolate to unseen inputs and assessing the robustness of their mechanisms, revealing that attention causes extrapolation failures.

Authors:Jiawei Zhang, Shuang Yang, Bo Li
Title: UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning
Abstract:
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.
中文: UDora是一个统一红队测试框架,通过针对性扰动劫持大语言模型智能体的推理过程,迫使其执行恶意操作,在多个数据集上表现优于现有方法。
English: UDora is a unified red teaming framework that hijacks LLM agents' reasoning processes through targeted perturbations to induce malicious actions, outperforming existing methods across multiple datasets.

Authors:Sunghyeon Woo, Sol Namkung, Sunwoo Lee, Inho Jeong, Beomseok Kim, Dongsuk Jeon
Title: PaCA: Partial Connection Adaptation for Efficient Fine-Tuning
Abstract:
Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at https://github.com/WooSunghyeon/paca.
中文: PaCA通过微调预训练权重中的部分连接,消除了适配器带来的延迟和内存开销,在保持精度的同时显著提升了训练效率。
English: PaCA enhances training efficiency by fine-tuning partial connections within pretrained weights, eliminating adapter-related latency and memory overhead while maintaining accuracy.

Authors:Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Elke Ruth Gizewski, Rainer Schubert
Title: What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning Methods
Abstract:
Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Particularly, in the field of medicine with its presence of high dimensional multimodal patient data, multimodal models characterize the next step. However, what is yet very underexplored is how these models process the source information in detail. Methods To this end, we implemented an occlusion-based both model and performance agnostic modality contribution method that quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we could determine a link between our metric and the performance of single modality trained nets. Conclusion The information gain through our metric holds remarkable potential to improve the development of multimodal models and the creation of datasets in the future. With our method we make a crucial contribution to the field of interpretability in deep learning based multimodal research and thereby notably push the integrability of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.
中文: 本研究提出了一种基于遮挡的方法,用于定量评估多模态深度学习模型中各模态的重要性,揭示了模态偏好和数据集不平衡问题,并为临床AI整合提供了可解释性支持。
English: This study introduces an occlusion-based method to quantitatively assess the importance of individual modalities in multimodal deep learning models, revealing modality preferences and dataset imbalances while providing interpretability for clinical AI integration.

Authors:Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Elke Ruth Gizewski, Rainer Schubert
Title: What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning
Abstract:
Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Given the prevalence of high-dimensional, multimodal patient data in medicine, the development of multimodal models marks a significant advancement. However, how these models process information from individual sources in detail is still underexplored. Methods To this end, we implemented an occlusion-based modality contribution method that is both model- and performance-agnostic. This method quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we provide fine-grained quantitative and visual attribute importance for each modality. Conclusion Our metric offers valuable insights that can support the advancement of multimodal model development and dataset creation. By introducing this method, we contribute to the growing field of interpretability in deep learning for multimodal research. This approach helps to facilitate the integration of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.
中文: 本研究提出了一种基于遮挡的方法,用于定量评估多模态深度学习模型中各模态的重要性,揭示了模态偏好和数据集不平衡问题,并为临床AI整合提供了可解释性支持。
English: This study introduces an occlusion-based method to quantitatively assess the importance of individual modalities in multimodal deep learning models, revealing modality preferences and dataset imbalances while providing interpretability for clinical AI integration.

Authors:Chenxu Dang, Zaipeng Duan, Pei An, Xinmin Zhang, Xuzhong Hu, Jie Ma
Title: FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection
Abstract:
Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, arbitrary result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in an adaptive and lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduces the overall complexity. Furthermore, a novel Grouped Hierarchical Fusion strategy is proposed, progressively performing sequence scaling and Intra-Group Fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at https://github.com/MSunDYY/FASTer.git.
中文: 提出的FASTer变换器通过动态选择并缩放焦点令牌,有效降低复杂度并优化特征融合,在Waymo开放数据集上以更优的性能和效率超越了现有最先进的检测器。
English: The proposed FASTer transformer dynamically selects and scales focal tokens to efficiently reduce complexity and enhance feature fusion, outperforming state-of-the-art detectors in both performance and efficiency on the Waymo Open Dataset.

Authors:Haoxin Liu, Zhiyuan Zhao, Shiduo Li, B. Aditya Prakash
Title: Evaluating System 1 vs. 2 Reasoning Approaches for Zero-Shot Time Series Forecasting: A Benchmark and Insights
Abstract:
Reasoning ability is crucial for solving challenging tasks. With the advancement of foundation models, such as the emergence of large language models (LLMs), a wide range of reasoning strategies has been proposed, including test-time enhancements, such as Chain-ofThought, and post-training optimizations, as used in DeepSeek-R1. While these reasoning strategies have demonstrated effectiveness across various challenging language or vision tasks, their applicability and impact on time-series forecasting (TSF), particularly the challenging zero-shot TSF, remain largely unexplored. In particular, it is unclear whether zero-shot TSF benefits from reasoning and, if so, what types of reasoning strategies are most effective. To bridge this gap, we propose ReC4TS, the first benchmark that systematically evaluates the effectiveness of popular reasoning strategies when applied to zero-shot TSF tasks. ReC4TS conducts comprehensive evaluations across datasets spanning eight domains, covering both unimodal and multimodal with short-term and longterm forecasting tasks. More importantly, ReC4TS provides key insights: (1) Self-consistency emerges as the most effective test-time reasoning strategy; (2) Group-relative policy optimization emerges as a more suitable approach for incentivizing reasoning ability during post-training; (3) Multimodal TSF benefits more from reasoning strategies compared to unimodal TSF. Beyond these insights, ReC4TS establishes two pioneering starting blocks to support future zero-shot TSF reasoning research: (1) A novel dataset, TimeThinking, containing forecasting samples annotated with reasoning trajectories from multiple advanced LLMs, and (2) A new and simple test-time scaling-law validated on foundational TSF models enabled by self-consistency reasoning strategy. All data and code are publicly accessible at: https://github.com/AdityaLab/OpenTimeR
中文摘要:ReC4TS基准首次系统评估了零样本时间序列预测中的推理策略,发现自我一致性是最有效的测试时方法,并为该领域研究提供了开创性数据集与测试框架。
English Summary: The ReC4TS benchmark evaluates reasoning strategies for zero-shot time-series forecasting, revealing self-consistency as the most effective test-time method and providing key resources for future research.

Authors:Luise Ge, Michael Lanier, Anindya Sarkar, Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik
Title: Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks
Abstract:
Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. Our code is available at https://github.com/CERL-WUSTL/PACMAN.
中文摘要:该研究提出了一种学习策略委员会的新方法,确保在执行任务时以高概率获得近似最优性能,既为低维任务提供理论保证又提供实用的梯度解决方案,实验表明其性能显著优于现有方法。
English Summary: The proposed approach learns a policy committee to ensure near-optimal performance with high probability for diverse tasks, offering both theoretical guarantees for low-dimensional tasks and a practical gradient-based solution, outperforming existing methods in experiments.

Authors:Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein
Title: When Can You Get Away with Low Memory Adam?
Abstract:
Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.
中文:SlimAdam是一种基于逐层信噪比分析选择性压缩二阶矩的内存优化版Adam优化器,在保持与Adam同等性能的同时,最高可减少98%的内存占用。
English: SlimAdam is a memory-efficient variant of Adam that selectively compresses second moments based on layer-wise SNR analysis, matching Adam's performance while reducing memory usage by up to 98%.

Authors:Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
Title: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Abstract:
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.
中文: EAGLE-3通过用直接标记预测取代特征预测,并采用多层特征融合技术,实现了最高6.5倍的加速比和吞吐量提升,同时能充分利用扩展训练数据的优势。
English: EAGLE-3 enhances speculative sampling by replacing feature prediction with direct token prediction and multi-layer feature fusion, achieving up to 6.5x speedup and improved throughput while benefiting fully from scaled training data.

Authors:Yisen Li, Lingfeng Yang, Wenxuan Shen, Pan Zhou, Yao Wan, Weiwei Lin, Dongping Chen
Title: CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom
Abstract:
Distilling advanced Large Language Models' instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies rely mainly on single-dimensional signals (i.e., reward scores, model perplexity), they fail to capture the complexity of instruction-following across diverse fields. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundational metrics that leverage Multi-LLM wisdom, informed by (1) diverse LLM responses and (2) reward model assessment. Building upon base metrics, we propose CrowdSelect, an integrated metric incorporating a clustering-based approach to maintain response diversity. Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. CrowdSelect, efficiently incorporating all metrics, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction. Code are available at https://github.com/listentm/crowdselect.
Chinese: 本研究提出CrowdSelect方法,通过整合多维指标和聚类技术来优化小型语言模型的指令跟随能力,在MT-bench和Arena-Hard基准测试中实现了最先进的性能表现。
English: This study introduces CrowdSelect, an innovative data selection method that enhances smaller language models by integrating multi-dimensional metrics and clustering to improve instruction-following performance, achieving state-of-the-art results on benchmarks like MT-bench and Arena-Hard.

Authors:Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal
Title: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Abstract:
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.
中文: RSQ通过旋转和缩放优先处理重要标记来优化模型量化,在多项任务和模型中表现优于基线方法,具备卓越的长上下文处理能力和广泛的适用性。
English: RSQ enhances model quantization by prioritizing important tokens through rotation and scaling, outperforming baselines across multiple tasks and models with superior long-context performance and broad generalizability.

Authors:Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr
Title: AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
Abstract:
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
中文:AutoAdvExBench是一个旨在评估大型语言模型能否自主利用对抗性示例防御的基准,揭示了攻击简化CTF类挑战与现实世界防御之间的显著性能差距。
English: AutoAdvExBench is a benchmark designed to assess whether large language models can autonomously exploit adversarial example defenses, revealing a significant performance gap between attacking simplified CTF-like challenges and real-world defenses.

Authors:Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
Title: Large-Scale Data Selection for Instruction Tuning
Abstract:
Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.
Chinese: 研究表明,尽管许多自动数据选择方法在扩展到数百万样本时表现不如随机选择甚至性能下降,但一种基于表征的高效计算方法(RDS+)在多样化任务中始终表现优异。
English: This study demonstrates that while many automated data selection methods fail to outperform random selection and even degrade with larger data pools, a compute-efficient representation-based approach (RDS+) consistently excels across diverse tasks when scaling to millions of samples.

Authors:Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: Visual-RFT: Visual Reinforcement Fine-Tuning
Abstract:
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
中文摘要:Visual-RFT通过可验证的视觉奖励函数优化大型视觉语言模型,将强化微调扩展至多模态领域,在数据稀缺的视觉任务中展现出卓越性能和泛化能力。
English Summary: Visual-RFT extends reinforcement fine-tuning to multi-modal tasks by using verifiable visual reward functions to optimize large vision-language models, achieving superior performance and generalization in visual tasks with limited data.

Authors:Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
Title: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
Abstract:
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at https://github.com/neilwen987/CSR_Adaptive_Rep
Chinese: 对比稀疏表示(CSR)提出了一种稀疏编码方法,将预训练嵌入转换为高维稀疏特征,相比嵌套表示学习,能以更高精度、更快检索速度和大幅缩短的训练时间实现自适应表示。
English: Contrastive Sparse Representation (CSR) introduces a sparse coding method that transforms pre-trained embeddings into high-dimensional sparse features, enabling adaptive representation with superior accuracy, faster retrieval, and significantly reduced training time compared to Matryoshka Representation Learning.

Authors:Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li
Title: Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Abstract:
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.
中文摘要:大型视觉语言模型在空间推理任务中表现不佳,而无需训练的ADAPTVIS方法通过动态调整注意力机制,在空间推理基准测试中实现了高达50%的性能提升。
English Summary: Large Vision Language Models face significant challenges in spatial reasoning, but a new training-free method called ADAPTVIS dynamically adjusts attention based on confidence, achieving up to 50% improvement on benchmarks.

Authors:Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Title: Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Abstract:
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
中文: 工具学习旨在增强大语言模型使用工具解决实际任务的能力,但现有基准简化了工具检索步骤,因此提出ToolRet基准以评估并提升检索模型在真实场景中的性能。
English: Tool learning enhances large language models with tools for practical tasks, but current benchmarks overlook realistic tool retrieval challenges, prompting the creation of ToolRet to evaluate and improve retrieval models' performance.

Authors:Sam Bowyer, Laurence Aitchison, Desi R. Ivanova
Title: Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints
Abstract:
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
中文: 对大语言模型的严谨统计评估需要精确的不确定性估计,但当前基于中心极限定理的方法在小数据场景下常低估不确定性,因此建议采用替代的频率学派和贝叶斯方法。
English: Rigorous statistical evaluations of large language models require accurate uncertainty estimates, but current methods relying on the Central Limit Theorem often underestimate uncertainty in small-data scenarios, prompting the recommendation of alternative frequentist and Bayesian approaches.

Authors:Wenhao Wang, Yi Yang
Title: VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
Abstract:
Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset and code are publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO and https://github.com/WangWenhao0716/BenchUFO under the CC BY 4.0 License.
中文: 本文提出了VideoUFO数据集,专门针对文本到视频模型在现实场景中的不足,通过收集109万条与用户关注点匹配的视频片段,有效提升了模型在薄弱主题上的生成效果。
English: This paper introduces VideoUFO, a novel video dataset designed to address the limitations of text-to-video models by aligning with real-world user interests, featuring over 1.09 million clips with minimal overlap from existing sources and demonstrating improved model performance on challenging topics.

Authors:Stergios Koutsioumpas, Hasan Sayginel, Mark Webster, Dan E Browne
Title: Automorphism Ensemble Decoding of Quantum LDPC Codes
Abstract:
We introduce AutDEC, a fast and accurate decoder for quantum error-correcting codes with large automorphism groups. Our decoder employs a set of automorphisms of the quantum code and an ensemble of belief propagation (BP) decoders. Each BP decoder is given a syndrome which is transformed by one of the automorphisms, and is run in parallel. For quantum codes, the accuracy of BP decoders is limited because short cycles occur in the Tanner graph and our approach mitigates this effect. We demonstrate decoding accuracy comparable to BP-OSD-0 with a lower time overhead for Quantum Reed-Muller (QRM) codes in the code capacity setting, and Bivariate Bicycle (BB) codes under circuit level noise. We provide a Python repository for use by the community and the results of our simulations.
中文:AutDEC是一种快速且准确的量子纠错解码器,它利用码的自同构和并行置信传播来克服精度限制,在QRM和BB码上实现了与BP-OSD-0相当的性能,同时减少了时间开销。
English: AutDEC is a fast and accurate quantum error-correcting decoder that uses code automorphisms and parallel belief propagation to overcome accuracy limitations, achieving performance comparable to BP-OSD-0 with reduced time for QRM and BB codes.

Authors:Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann
Title: Quality Measures for Dynamic Graph Generative Models
Abstract:
Deep generative models have recently achieved significant success in modeling graph data, including dynamic graphs, where topology and features evolve over time. However, unlike in vision and natural language domains, evaluating generative models for dynamic graphs is challenging due to the difficulty of visualizing their output, making quantitative metrics essential. In this work, we develop a new quality metric for evaluating generative models of dynamic graphs. Current metrics for dynamic graphs typically involve discretizing the continuous-evolution of graphs into static snapshots and then applying conventional graph similarity measures. This approach has several limitations: (a) it models temporally related events as i.i.d. samples, failing to capture the non-uniform evolution of dynamic graphs; (b) it lacks a unified measure that is sensitive to both features and topology; (c) it fails to provide a scalar metric, requiring multiple metrics without clear superiority; and (d) it requires explicitly instantiating each static snapshot, leading to impractical runtime demands that hinder evaluation at scale. We propose a novel metric based on the \textit{Johnson-Lindenstrauss} lemma, applying random projections directly to dynamic graph data. This results in an expressive, scalar, and application-agnostic measure of dynamic graph similarity that overcomes the limitations of traditional methods. We also provide a comprehensive empirical evaluation of metrics for continuous-time dynamic graphs, demonstrating the effectiveness of our approach compared to existing methods. Our implementation is available at https://github.com/ryienh/jl-metric.
中文摘要:本文提出了一种基于随机投影的新型质量评估指标,用于评估动态图的生成模型,通过提供统一、标量的度量方法克服了传统方法的局限性,无需离散化即可同时捕捉拓扑和特征的动态演化。
English Summary: This paper introduces a novel quality metric based on random projections to evaluate generative models for dynamic graphs, addressing limitations of current methods by providing a unified, scalar measure sensitive to both topology and features without requiring discretization.

Authors:Chenxi Wang, Tianle Gu, Zhongyu Wei, Lang Gao, Zirui Song, Xiuying Chen
Title: Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia
Abstract:
Human readers can efficiently comprehend scrambled words, a phenomenon known as Typoglycemia, primarily by relying on word form; if word form alone is insufficient, they further utilize contextual cues for interpretation. While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear. To investigate this, we conduct controlled experiments to analyze the roles of word form and contextual information in semantic reconstruction and examine LLM attention patterns. Specifically, we first propose SemRecScore, a reliable metric to quantify the degree of semantic reconstruction, and validate its effectiveness. Using this metric, we study how word form and contextual information influence LLMs' semantic reconstruction ability, identifying word form as the core factor in this process. Furthermore, we analyze how LLMs utilize word form and find that they rely on specialized attention heads to extract and process word form information, with this mechanism remaining stable across varying levels of word scrambling. This distinction between LLMs' fixed attention patterns primarily focused on word form and human readers' adaptive strategy in balancing word form and contextual information provides insights into enhancing LLM performance by incorporating human-like, context-aware mechanisms.
Chinese: 研究表明,大型语言模型主要依赖固定的注意力模式处理词形进行语义重建,而人类则能灵活结合词形与上下文;通过融入类人的语境感知机制,有望提升模型的性能。
English: This study reveals that while large language models (LLMs) primarily rely on fixed attention patterns focused on word form for semantic reconstruction—unlike humans who adaptively balance word form and context—integrating human-like contextual mechanisms could enhance LLM performance.

Authors:Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue
Title: Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Abstract:
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.
中文:Spark-TTS采用BiCodec单流语音编解码器,将语音分解为语义和说话人属性两种互补标记,结合Qwen2.5大语言模型和思维链生成方法,实现了粗细粒度可控的零样本语音合成,并通过VoxBox数据集提升了定制化能力。
English: Spark-TTS introduces BiCodec, a single-stream speech codec that separates linguistic and speaker attributes, enabling highly customizable zero-shot text-to-speech synthesis with both coarse and fine-grained control through the Qwen2.5 LLM and a chain-of-thought approach, supported by the VoxBox dataset.

Authors:Linhao Huang, Jing Yu
Title: ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts
Abstract:
Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: https://github.com/misaka12435/ToLo.
中文: 本文提出了ToLo,一种无需训练的双阶段框架,通过聚合和分离阶段的注意力损失改进高重叠布局的图像生成,并在按IoU分级的新数据集上验证了其有效性。
English: The paper introduces ToLo, a two-stage training-free framework that improves layout-to-image generation for high-overlap scenarios by using attention-based losses in aggregation and separation stages, validated on a new dataset partitioned by IoU levels.

Authors:Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, Dongwoo Kim
Title: CoPL: Collaborative Preference Learning for Personalizing LLMs
Abstract:
Personalizing large language models (LLMs) is important for aligning outputs with diverse user preferences, yet existing methods struggle with flexibility and generalization. We propose CoPL (Collaborative Preference Learning), a graph-based collaborative filtering framework that models user-response relationships to enhance preference estimation, particularly in sparse annotation settings. By integrating a mixture of LoRA experts, CoPL efficiently fine-tunes LLMs while dynamically balancing shared and user-specific preferences. Additionally, an optimization-free adaptation strategy enables generalization to unseen users without fine-tuning. Experiments on UltraFeedback-P demonstrate that CoPL outperforms existing personalized reward models, effectively capturing both common and controversial preferences, making it a scalable solution for personalized LLM alignment. The code is available at https://github.com/ml-postech/CoPL.
中文: CoPL提出了一种基于图的协同过滤框架,通过建模用户-响应关系并利用LoRA专家动态平衡共享与个体偏好,在稀疏标注场景下显著提升了大型语言模型的个性化性能。
English: CoPL introduces a graph-based collaborative filtering framework that enhances LLM personalization by modeling user-response relationships and dynamically balancing shared and individual preferences through LoRA experts, achieving superior performance in sparse data scenarios.

Authors:Shuvendu Roy, Franklin Ogidi, Ali Etemad, Elham Dolatabadi, Arash Afkanpour
Title: A Shared Encoder Approach to Multimodal Representation Learning
Abstract:
Multimodal representation learning has demonstrated remarkable potential in enabling models to process and integrate diverse data modalities, such as text and images, for improved understanding and performance. While the medical domain can benefit significantly from this paradigm, the scarcity of paired multimodal data and reliance on proprietary or pretrained encoders pose significant challenges. In this work, we present a shared encoder framework for multimodal representation learning tailored to the medical domain. Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features. Empirical results demonstrate that our shared encoder idea achieves superior performance compared to separate modality-specific encoders, demonstrating improved generalization in data-constrained settings. Notably, the performance gains are more pronounced with fewer training examples, underscoring the efficiency of our shared encoder framework for real-world medical applications with limited data. Our code and experiment setup are available at https://github.com/VectorInstitute/shared_encoder.
Chinese: 本文提出了一种适用于医疗领域的多模态表示学习共享编码器框架,通过跨模态共享编码器参数,在数据有限的情况下实现了更优的性能和泛化能力。
English: This paper introduces a shared encoder framework for multimodal representation learning in the medical domain, which uses a single set of encoder parameters across different data modalities and achieves superior performance and generalization, especially with limited training data.

Authors:Yingxue Xu, Fengtao Zhou, Chenyu Zhao, Yihui Wang, Can Yang, Hao Chen
Title: Distilled Prompt Learning for Incomplete Multimodal Survival Prediction
Abstract:
The integration of multimodal data including pathology images and gene profiles is widely applied in precise survival prediction. Despite recent advances in multimodal survival models, collecting complete modalities for multimodal fusion still poses a significant challenge, hindering their application in clinical settings. Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. In the first stage, Unimodal Prompting (UniPro) distills the knowledge distribution of each modality, preparing for supplementing modality-specific knowledge of the missing modality in the subsequent stage. In the second stage, Multimodal Prompting (MultiPro) leverages available modalities as prompts for LLMs to infer the missing modality, which provides modality-common information. Simultaneously, the unimodal knowledge acquired in the first stage is injected into multimodal inference to compensate for the modality-specific knowledge of the missing modality. Extensive experiments covering various missing scenarios demonstrated the superiority of the proposed method. The code is available at https://github.com/Innse/DisPro.
中文摘要:提出的蒸馏提示学习框架(DisPro)通过两阶段提示方法,利用大型语言模型的强大鲁棒性,有效补偿缺失模态的特定信息和通用信息,解决了生存预测中多模态数据不完整的难题。
English Summary: The proposed Distilled Prompt Learning framework (DisPro) effectively addresses incomplete multimodal data in survival prediction by leveraging Large Language Models through two-stage prompting to compensate for both modality-specific and modality-common information of missing modalities.

Authors:Luyi Qiu, Tristan Till, Xiaobao Guo, Adams Wai-Kin Kong
Title: SparseMamba-PCL: Scribble-Supervised Medical Image Segmentation via SAM-Guided Progressive Collaborative Learning
Abstract:
Scribble annotations significantly reduce the cost and labor required for dense labeling in large medical datasets with complex anatomical structures. However, current scribble-supervised learning methods are limited in their ability to effectively propagate sparse annotation labels to dense segmentation masks and accurately segment object boundaries. To address these issues, we propose a Progressive Collaborative Learning framework that leverages novel algorithms and the Med-SAM foundation model to enhance information quality during training. (1) We enrich ground truth scribble segmentation labels through a new algorithm, propagating scribbles to estimate object boundaries. (2) We enhance feature representation by optimizing Med-SAM-guided training through the fusion of feature embeddings from Med-SAM and our proposed Sparse Mamba network. This enriched representation also facilitates the fine-tuning of the Med-SAM decoder with enriched scribbles. (3) For inference, we introduce a Sparse Mamba network, which is highly capable of capturing local and global dependencies by replacing the traditional sequential patch processing method with a skip-sampling procedure. Experiments on the ACDC, CHAOS, and MSCMRSeg datasets validate the effectiveness of our framework, outperforming nine state-of-the-art methods. Our code is available at \href{https://github.com/QLYCode/SparseMamba-PCL}{SparseMamba-PCL.git}.
The proposed Progressive Collaborative Learning framework enhances scribble-supervised medical image segmentation by propagating sparse scribble annotations to dense masks and improving boundary accuracy through novel algorithms and Med-SAM integration, outperforming nine state-of-the-art methods on multiple datasets.
English Summary:

Authors:Kaveen Perera, Fouad Khelifi, Ammar Belatreche
Title: Robust Palm-Vein Recognition Using the MMD Filter: Improving SIFT-Based Feature Matching
Abstract:
A major challenge with palm vein images is that slight movements of the fingers and thumb, or variations in hand posture, can stretch the skin in different areas and alter the vein patterns. This can result in an infinite number of variations in palm vein images for a given individual. This paper introduces a novel filtering technique for SIFT-based feature matching, known as the Mean and Median Distance (MMD) Filter. This method evaluates the differences in keypoint coordinates and computes the mean and median in each direction to eliminate incorrect matches. Experiments conducted on the 850nm subset of the CASIA dataset indicate that the proposed MMD filter effectively preserves correct points while reducing false positives detected by other filtering methods. A comparison with existing SIFT-based palm vein recognition systems demonstrates that the proposed MMD filter delivers outstanding performance, achieving lower Equal Error Rate (EER) values. This article presents an extended author's version based on our previous work, A Keypoint Filtering Method for SIFT based Palm-Vein Recognition.
中文摘要:本文提出一种均值中值距离(MMD)滤波器,通过分析关键点坐标差异有效剔除掌静脉识别中错误的SIFT特征匹配,在CASIA数据集上以更低等错误率实现了优越性能。
English Summary: This paper proposes a Mean and Median Distance (MMD) filter that effectively removes incorrect SIFT feature matches in palm vein recognition by analyzing keypoint coordinate differences, achieving superior performance with lower Equal Error Rates on the CASIA dataset.

Authors:Saad Ejaz, Hriday Bavle, Laura Ribeiro, Holger Voos, Jose Luis Sanchez-Lopez
Title: Category-level Meta-learned NeRF Priors for Efficient Object Mapping
Abstract:
In 3D object mapping, category-level priors enable efficient object reconstruction and canonical pose estimation, requiring only a single prior per semantic category (e.g., chair, book, laptop, etc.). DeepSDF has been used predominantly as a category-level shape prior, but it struggles to reconstruct sharp geometry and is computationally expensive. In contrast, NeRFs capture fine details but have yet to be effectively integrated with category-level priors in a real-time multi-object mapping framework. To bridge this gap, we introduce PRENOM, a Prior-based Efficient Neural Object Mapper that integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical object pose estimation. PRENOM gets to know objects on a first-name basis by meta-learning on synthetic reconstruction tasks generated from open-source shape datasets. To account for object category variations, it employs a multi-objective genetic algorithm to optimize the NeRF architecture for each category, balancing reconstruction quality and training time. Additionally, prior-based probabilistic ray sampling directs sampling toward expected object regions, accelerating convergence and improving reconstruction quality under constrained resources. Experimental results highlight the ability of PRENOM to achieve high-quality reconstructions while maintaining computational feasibility. Specifically, comparisons with prior-free NeRF-based approaches on a synthetic dataset show a 21\% lower Chamfer distance. Furthermore, evaluations against other approaches using shape priors on a noisy real-world dataset indicate a 13\% improvement averaged across all reconstruction metrics, and comparable pose and size estimation accuracy, while being trained for 5$\times$ less time. Code available at: https://github.com/snt-arg/PRENOM
中文: PRENOM通过将类别级先验与物体级神经辐射场相结合,提升了重建效率并实现了规范姿态估计,相比现有方法以更少训练时间获得了更高质量的重建结果。
English: PRENOM integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical pose estimation, achieving higher quality reconstructions with reduced computational time compared to existing methods.

Authors:Mojtaba Safari, Shansong Wang, Zach Eidex, Qiang Li, Erik H. Middlebrooks, David S. Yu, Xiaofeng Yang
Title: MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting
Abstract:
Objective:This study introduces a residual error-shifting mechanism that drastically reduces sampling steps while preserving critical anatomical details, thus accelerating MRI reconstruction. Approach:We propose a novel diffusion-based SR framework called Res-SRDiff, which integrates residual error shifting into the forward diffusion process. This enables efficient HR image reconstruction by aligning the degraded HR and LR distributions.We evaluated Res-SRDiff on ultra-high-field brain T1 MP2RAGE maps and T2-weighted prostate images, comparing it with Bicubic, Pix2pix, CycleGAN, and a conventional denoising diffusion probabilistic model with vision transformer backbone (TM-DDPM), using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), gradient magnitude similarity deviation (GMSD), and learned perceptual image patch similarity (LPIPS). Main results: Res-SRDiff significantly outperformed all comparative methods in terms of PSNR, SSIM, and GMSD across both datasets, with statistically significant improvements (p-values<<0.05). The model achieved high-fidelity image restoration with only four sampling steps, drastically reducing computational time to under one second per slice, which is substantially faster than conventional TM-DDPM with around 20 seconds per slice. Qualitative analyses further demonstrated that Res-SRDiff effectively preserved fine anatomical details and lesion morphology in both brain and pelvic MRI images. Significance: Our findings show that Res-SRDiff is an efficient and accurate MRI SR method, markedly improving computational efficiency and image quality. Integrating residual error shifting into the diffusion process allows for rapid and robust HR image reconstruction, enhancing clinical MRI workflows and advancing medical imaging research. The source at:https://github.com/mosaf/Res-SRDiff
中文摘要:本研究提出的Res-SRDiff方法通过残差偏移机制,仅需四步采样即可实现高质量MRI图像重建,在计算效率和图像质量上均显著优于现有方法。
English Summary: This study introduces Res-SRDiff, a diffusion-based MRI reconstruction method that uses a residual error-shifting mechanism to achieve high-fidelity image restoration in just four sampling steps, significantly outperforming other methods in both computational efficiency and image quality.

Authors:Chao Ye, Haoyuan Li, Weiyang Lin, Xianqiang Yang
Title: MLINE-VINS: Robust Monocular Visual-Inertial SLAM With Flow Manhattan and Line Features
Abstract:
In this paper we introduce MLINE-VINS, a novel monocular visual-inertial odometry (VIO) system that leverages line features and Manhattan Word assumption. Specifically, for line matching process, we propose a novel geometric line optical flow algorithm that efficiently tracks line features with varying lengths, whitch is do not require detections and descriptors in every frame. To address the instability of Manhattan estimation from line features, we propose a tracking-by-detection module that consistently tracks and optimizes Manhattan framse in consecutive images. By aligning the Manhattan World with the VIO world frame, the tracking could restart using the latest pose from back-end, simplifying the coordinate transformations within the system. Furthermore, we implement a mechanism to validate Manhattan frames and a novel global structural constraints back-end optimization. Extensive experiments results on vairous datasets, including benchmark and self-collected datasets, show that the proposed approach outperforms existing methods in terms of accuracy and long-range robustness. The source code of our method is available at: https://github.com/LiHaoy-ux/MLINE-VINS.
中文: 本文提出MLINE-VINS单目视觉惯性里程计系统,通过创新的线特征跟踪、曼哈顿世界对齐和后端优化机制,在多个数据集上实现了优于现有方法的精度和长距离鲁棒性。
English: This paper presents MLINE-VINS, a monocular visual-inertial odometry system that enhances accuracy and robustness through innovative line feature tracking, Manhattan World alignment, and backend optimization, outperforming existing methods across multiple datasets.

Authors:Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Title: AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning
Abstract:
In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. Moreover, their reliance on fixed sampling patterns limits both accuracy and the ability to capture fine details in low-resolution images. To address these challenges, we introduce two plug-and-play modules designed to capture and leverage pixel information effectively in Look-Up Table (LUT) based super-resolution networks. Our method introduces Automatic Sampling (AutoSample), a flexible LUT sampling approach where sampling weights are automatically learned during training to adapt to pixel variations and expand the receptive field without added inference cost. We also incorporate Adaptive Residual Learning (AdaRL) to enhance inter-layer connections, enabling detailed information flow and improving the network's ability to reconstruct fine details. Our method achieves significant performance improvements on both MuLUT and SPF-LUT while maintaining similar storage sizes. Specifically, for MuLUT, we achieve a PSNR improvement of approximately +0.20 dB improvement on average across five datasets. For SPF-LUT, with more than a 50% reduction in storage space and about a 2/3 reduction in inference time, our method still maintains performance comparable to the original. The code is available at https://github.com/SuperKenVery/AutoLUT.
中文摘要:本研究提出AutoSample和AdaRL两个即插即用模块,通过自适应学习像素采样和增强细节重建能力,在保持存储效率的同时显著提升了基于查找表的超分辨率网络性能。
English Summary: This paper introduces two plug-and-play modules, AutoSample and AdaRL, that enhance LUT-based super-resolution networks by adaptively learning pixel sampling and improving detail reconstruction, achieving significant performance gains with minimal storage and computational costs.

Authors:Biao Xiong, Longjun Zhang, Ruiqi Huang, Junwei Zhou, S. R. U. N. Jafri, Bojian Wu, Fashuai Li
Title: VF-Plan: Bridging the Art Gallery Problem and Static LiDAR Scanning with Visibility Field Optimization
Abstract:
Viewpoint planning is critical for efficient 3D data acquisition in applications such as 3D reconstruction, building life-cycle management, navigation, and interior decoration. However, existing methods often neglect key optimization objectives specific to static LiDAR systems, resulting in redundant or disconnected viewpoint networks. The viewpoint planning problem (VPP) extends the classical Art Gallery Problem (AGP) by requiring full coverage, strong registrability, and coherent network connectivity under constrained sensor capabilities. To address these challenges, we introduce a novel Visibility Field (VF) that accurately captures the directional and range-dependent visibility properties of static LiDAR scanners. We further observe that visibility information naturally converges onto a 1D skeleton embedded in the 2D space, enabling significant searching space reduction. Leveraging these insights, we develop a greedy optimization algorithm tailored to the VPP, which constructs a minimal yet fully connected Viewpoint Network (VPN) with low redundancy. Experimental evaluations across diverse indoor and outdoor scenarios confirm the scalability and robustness of our method. Compared to expert-designed VPNs and existing state-of-the-art approaches, our algorithm achieves comparable or fewer viewpoints while significantly enhancing connectivity. In particular, it reduces the weighted average path length by approximately 95%, demonstrating substantial improvements in compactness and structural efficiency. Code is available at https://github.com/xiongbiaostar/VFPlan.
中文: 本研究提出了一种新颖的可见性场和贪心优化算法,有效解决了静态激光雷达系统的视点规划问题,实现了最少但完全连接的视点网络,显著提升了连通性并降低了冗余。
English: This study introduces a novel Visibility Field and a greedy optimization algorithm to efficiently solve the viewpoint planning problem for static LiDAR systems, achieving minimal yet fully connected networks with significantly improved connectivity and reduced redundancy.

Authors:Alexander Baranov, Anna Palatkina, Yulia Makovka, Pavel Braslavski
Title: KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines
Abstract:
We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts -- each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities -- the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24
中文: KoWit-24发布了包含俄语新闻标题细粒度文字游戏标注的数据集,揭示了惯用语转换的主要类型,并证明大语言模型在文字游戏识别任务中仍有明显不足。
English: KoWit-24 introduces a Russian news headline dataset with detailed wordplay annotations, highlighting prevalent transformations of collocations and idioms while demonstrating significant challenges for LLMs in detection tasks.

Authors:Ramkrishna Acharya
Title: Compare different SG-Schemes based on large least square problems
Abstract:
This study reviews popular stochastic gradient-based schemes based on large least-square problems. These schemes, often called optimizers in machine learning, play a crucial role in finding better model parameters. Hence, this study focuses on viewing such optimizers with different hyper-parameters and analyzing them based on least square problems. Codes that produced results in this work are available on https://github.com/q-viper/gradients-based-methods-on-large-least-square.
中文: 本研究针对大规模最小二乘问题,分析了不同超参数下的随机梯度优化器,并在GitHub上提供了相关代码。
English: This study analyzes various stochastic gradient-based optimizers with different hyperparameters for large least-square problems, providing code for the results on GitHub.

Authors:Ramkrishna Acharya
Title: An Approach for Air Drawing Using Background Subtraction and Contour Extraction
Abstract:
In this paper, we propose a novel approach for air drawing that uses image processing techniques to draw on the screen by moving fingers in the air. This approach benefits a wide range of applications such as sign language, in-air drawing, and 'writing' in the air as a new way of input. The approach starts with preparing ROI (Region of Interest) background images by taking a running average in initial camera frames and later subtracting it from the live camera frames to get a binary mask image. We calculate the pointer's position as the top of the contour on the binary image. When drawing a circle on the canvas in that position, it simulates the drawing. Furthermore, we combine the pre-trained Tesseract model for OCR purposes. To address the false contours, we perform hand detection based on the haar cascade before performing the background subtraction. In an experimental setup, we achieved a latency of only 100ms in air drawing. The code used to this research are available in GitHub as https://github.com/q-viper/Contour-Based-Writing
中文: 本文提出一种基于图像处理和背景减除的空中绘图方法,通过追踪手指运动实现手语识别和空中书写等功能,实验延迟仅100毫秒且代码已开源。
English: This paper introduces an air drawing method using image processing and background subtraction to track finger movements for applications like sign language and in-air writing, achieving 100ms latency with code available on GitHub.

Authors:Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
Title: Liger: Linearizing Large Language Models to Gated Recurrent Structures
Abstract:
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.
中文摘要:Liger是一种创新方法,可将预训练大语言模型转化为无需添加参数的线性门控循环结构,通过轻量微调保持模型性能,并在多项基准测试中取得优异表现。
English Summary: Liger is a novel method that converts pretrained LLMs into efficient gated linear recurrent models without adding parameters, using lightweight fine-tuning to maintain performance while enabling competitive results across benchmarks.

Authors:Zhihai Bi, Kai Chen, Chunxin Zheng, Yulin Li, Haoang Li, Jun Ma
Title: Interactive Navigation for Legged Manipulators with Learned Arm-Pushing Controller
Abstract:
Interactive navigation is crucial in scenarios where proactively interacting with objects can yield shorter paths, thus significantly improving traversal efficiency. Existing methods primarily focus on using the robot body to relocate large obstacles (which could be comparable to the size of a robot). However, they prove ineffective in narrow or constrained spaces where the robot's dimensions restrict its manipulation capabilities. This paper introduces a novel interactive navigation framework for legged manipulators, featuring an active arm-pushing mechanism that enables the robot to reposition movable obstacles in space-constrained environments. To this end, we develop a reinforcement learning-based arm-pushing controller with a two-stage reward strategy for large-object manipulation. Specifically, this strategy first directs the manipulator to a designated pushing zone to achieve a kinematically feasible contact configuration. Then, the end effector is guided to maintain its position at appropriate contact points for stable object displacement while preventing toppling. The simulations validate the robustness of the arm-pushing controller, showing that the two-stage reward strategy improves policy convergence and long-term performance. Real-world experiments further demonstrate the effectiveness of the proposed navigation framework, which achieves shorter paths and reduced traversal time. The open-source project can be found at https://github.com/Zhihaibi/Interactive-Navigation-for-legged-manipulator.git.
中文: 本文提出了一种基于强化学习的腿式机械臂交互导航框架,采用主动臂推机制在受限空间中重新定位障碍物,通过两阶段奖励策略提高了路径效率并减少了穿越时间。
English: This paper introduces a reinforcement learning-based interactive navigation framework for legged manipulators that uses an active arm-pushing mechanism to reposition obstacles in confined spaces, improving path efficiency and reducing traversal time through a two-stage reward strategy.

Authors:Mihir Kulkarni, Welf Rehberg, Kostas Alexis
Title: Aerial Gym Simulator: A Framework for Highly Parallelized Simulation of Aerial Robots
Abstract:
This paper contributes the Aerial Gym Simulator, a highly parallelized, modular framework for simulation and rendering of arbitrary multirotor platforms based on NVIDIA Isaac Gym. Aerial Gym supports the simulation of under-, fully- and over-actuated multirotors offering parallelized geometric controllers, alongside a custom GPU-accelerated rendering framework for ray-casting capable of capturing depth, segmentation and vertex-level annotations from the environment. Multiple examples for key tasks, such as depth-based navigation through reinforcement learning are provided. The comprehensive set of tools developed within the framework makes it a powerful resource for research on learning for control, planning, and navigation using state information as well as exteroceptive sensor observations. Extensive simulation studies are conducted and successful sim2real transfer of trained policies is demonstrated. The Aerial Gym Simulator is open-sourced at: https://github.com/ntnu-arl/aerial_gym_simulator.
中文: 本文提出Aerial Gym Simulator,这是一个基于NVIDIA Isaac Gym的高效模块化仿真框架,支持多种多旋翼平台的并行控制与GPU加速渲染,可用于强化学习导航等任务,并成功实现了仿真到现实的策略迁移。
English: This paper introduces the Aerial Gym Simulator, a versatile and efficient framework for simulating various multirotor systems with parallelized controllers and GPU-accelerated rendering, supporting tasks like reinforcement learning navigation and demonstrating successful sim2real transfer.

Authors:Huifeng Yin, Yu Zhao, Minghao Wu, Xuanfan Ni, Bo Zeng, Hao Wang, Tianqi Shi, Liangying Shao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Title: Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Abstract:
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the constructed data. We conduct evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking. The project homepage is https://github.com/AIDC-AI/Marco-o1.
中文: 从大型推理模型蒸馏长思维链数据会导致小模型学习困难和偏见继承,通过蒙特卡洛树搜索构建树状推理数据并采用思维链感知训练方法,可显著提升蒸馏模型的推理性能。
English: Distilling long chain-of-thought data from large reasoning models causes learning difficulties and bias inheritance in smaller models, which is mitigated by constructing tree-based reasoning data via Monte Carlo Tree Search and implementing CoT-aware training techniques to significantly enhance reasoning performance.

Authors:Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert Müller, Stefan Gugler
Title: How simple can you go? An off-the-shelf transformer approach to molecular dynamics
Abstract:
Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an "off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using this "off-the-shelf'' approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the effects of being only approximately equivariant and energy conserving for MD simulations, proposing a novel method for distinguishing the errors resulting from non-equivariance from other sources of inaccuracies like numerical rounding errors. While our model exhibits runaway energy increases on larger structures, we show approximately energy-conserving NVE simulations for a range of small structures.
中文摘要:本研究提出MD-ET模型,这是一种仅对通用Transformer进行最小修改的分子动力学方法,在不内置旋转等变性或能量守恒等物理约束的情况下实现了最先进的性能,同时分析了近似等变性对模拟效果的影响。
English Summary: This study introduces MD-ET, a minimally modified transformer model for molecular dynamics that achieves state-of-the-art results without built-in physical constraints like rotational equivariance or energy conservation, while analyzing the effects of approximate equivariance in simulations.

Authors:Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin
Title: From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Abstract:
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
中文: 本文系统综述了人工智能在假设提出、验证及论文发表等研究环节的加速作用,并探讨了当前挑战与未来方向。
English: This paper systematically reviews how AI accelerates research across hypothesis formulation, validation, and manuscript publication, while addressing current challenges and future directions.

Authors:Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu
Title: Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace
Abstract:
Large language model (LLM) is considered a milestone towards achieving Artificial General Intelligence (AGI). With its advanced emergent capabilities, it adapt to a wide range of specific applications. Fine-tuning LLMs for various downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) is well-known for its parameter efficiency. It can reduce the number of parameters needed to fine-tune LLMs by several orders of magnitude. However, LoRA-based approaches encounter a significant limitation due to the bottleneck imposed by rank one decomposition. As the parameters count in LLMs increase, even rank one decomposition might surpass the number of parameters truly necessary for handling more downstream tasks. In this paper, we propose a new method for Parameter-Efficient Fine-Tuning (PEFT) via deconvolution in subspace, dubbed as DCFT. We innovatively use deconvolution to complete details and enhance knowledge in subspace incremental matrices, and dynamically control parameters by adjusting the kernel size, unconstrained by rank-one decomposition. Extensive experiments are conducted to validate the effectiveness of DCFT. Results show that compared to LoRA, DCFT achieve an 8$\times$ reduction in parameters, and still achieves highly impressive performance. Our code is available here: https://github.com/Godz-z/DCFT.
中文: 提出的DCFT方法通过子空间反卷积实现参数高效微调,动态控制参数不受秩约束,相比LoRA减少8倍参数量仍保持优异性能。
English: The proposed DCFT method introduces deconvolution in subspace for parameter-efficient fine-tuning, dynamically controlling parameters without rank constraints and achieving an 8× reduction compared to LoRA while maintaining high performance.

Authors:Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang
Title: UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Abstract:
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
中文:UFO框架通过开放式语言接口统一了细粒度视觉感知任务,将检测、分割和视觉语言任务整合到单一模型中,并在多个基准测试中实现了领先性能。
English: The UFO framework unifies fine-grained visual perception tasks through a language interface, integrating detection, segmentation, and vision-language tasks into a single model while achieving state-of-the-art performance on benchmarks.

Authors:Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang
Title: UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Abstract:
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
中文:UFO框架通过开放式语言接口统一了细粒度视觉感知任务,将检测、分割和视觉语言任务整合到单一模型中,并在多个基准测试中实现了领先性能。
English: The UFO framework unifies fine-grained visual perception tasks through a language interface, integrating detection, segmentation, and vision-language tasks into a single model while achieving state-of-the-art performance on benchmarks.

Authors:Kun Zhang, Jingyu Li, Zhe Li, Jingjing Zhang, Fan Li, Yandong Liu, Rui Yan, Zihang Jiang, Nan Chen, Lei Zhang, Yongdong Zhang, Zhendong Mao, S. Kevin Zhou
Title: Composed Multi-modal Retrieval: A Survey of Approaches and Applications
Abstract:
The burgeoning volume of multi-modal data necessitates advanced retrieval paradigms beyond unimodal and cross-modal approaches. Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology, enabling users to query images or videos by integrating a reference visual input with textual modifications, thereby achieving unprecedented flexibility and precision. This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications. CMR is categorized into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data construction, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and linear integration in zero-shot CMR, and semi-supervised CMR that leverages generated pseudo-triplets while addressing data noise/uncertainty. Additionally, we extensively survey the diverse application landscape of CMR, highlighting its transformative potential in e-commerce, social media, search engines, public security, etc. Seven high impact application scenarios are explored in detail with benchmark data sets and performance analysis. Finally, we further provide new potential research directions with the hope of inspiring exploration in other yet-to-be-explored fields. A curated list of works is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval
中文: 组合多模态检索(CMR)是一种前沿技术,通过结合视觉输入与文本修改实现灵活精准的图像或视频查询,涵盖监督、零样本和半监督学习范式,并在电子商务、公共安全等领域展现出变革性应用潜力。
English: Composed Multi-modal Retrieval (CMR) is a cutting-edge technology that enables flexible and precise querying of images or videos by combining visual inputs with textual modifications, addressing challenges through supervised, zero-shot, and semi-supervised learning paradigms, and showcasing transformative applications across fields like e-commerce and public security.

Authors:Xu Liang
Title: Group Relative Policy Optimization for Image Captioning
Abstract:
Image captioning tasks usually use two-stage training to complete model optimization. The first stage uses cross-entropy as the loss function for optimization, and the second stage uses self-critical sequence training (SCST) for reinforcement learning optimization. However, the SCST algorithm has certain defects. SCST relies only on a single greedy decoding result as a baseline. If the model itself is not stable enough, the greedy decoding result may be relatively worst, which will lead to a high variance of advantage estimation, further leading to unstable policy updates. In addition, SCST only compares one sampling result with the greedy decoding result, and the generation diversity is limited, which may fall into a local optimum. In this paper, we propose using the latest Group Relative Policy Optimization (GRPO) reinforcement learning algorithm as an optimization solution for the second stage. GRPO generates multiple candidate captions for the input image and then continuously optimizes the model through intragroup comparison. By constraining the amplitude of policy updates and KL divergence, the stability of the model during training is greatly guaranteed. In addition, compared to SCST, which only samples one answer, GRPO samples and generates multiple answers. Multiple candidate answers in the group cover a wider solution space. Combined with KL divergence constraints, GRPO can improve diversity while ensuring model stability. The code for this article is available at https://github.com/liangxu-one/ms-models/tree/image_caption_grpo/research/arxiv_papers/Image_Caption_GRPO.
中文: 本文提出在图像描述任务的第二阶段使用群组相对策略优化(GRPO)强化学习算法替代自临界序列训练(SCST),通过生成多个候选描述并进行组内比较,结合KL散度约束,有效解决SCST方差高和多样性受限的问题,实现更稳定、更多样化的模型优化。
English: The abstract proposes replacing the self-critical sequence training (SCST) algorithm in image captioning with the Group Relative Policy Optimization (GRPO) to address SCST's issues of high variance and limited diversity by generating multiple candidate captions and using intragroup comparisons with KL divergence constraints for more stable and diverse model optimization.

Authors:Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li
Title: PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
Abstract:
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.
中文: 本文提出一种用于流水线并行的内存卸载策略,能以可忽略的开销大幅减少激活内存,相比张量并行实现了更好的可扩展性和高达19%的加速效果。
English: This paper introduces a memory offload strategy for pipeline parallelism that significantly reduces activation memory with minimal overhead, enabling better scalability and up to 19% acceleration compared to tensor parallelism.

Authors:Xuewen Liu, Zhikai Li, Qingyi Gu
Title: CacheQuant: Comprehensively Accelerated Diffusion Models
Abstract:
Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score. Our code are open-sourced: https://github.com/BienLuky/CacheQuant .
中文:CacheQuant框架通过联合优化模型缓存与量化技术,在保持性能基本不变的前提下,显著提升了扩散模型的推理速度并实现了模型压缩。
English: The proposed CacheQuant framework jointly optimizes model caching and quantization to comprehensively accelerate diffusion models, achieving significant speedup and compression with minimal performance degradation.

Authors:Anas Abdelkarim, Holger Voos, Daniel Görges
Title: ecg2o: A Seamless Extension of g2o for Equality-Constrained Factor Graph Optimization
Abstract:
Factor graph optimization serves as a fundamental framework for robotic perception, enabling applications such as pose estimation, simultaneous localization and mapping (SLAM), structure-from-motion (SfM), and situational awareness. Traditionally, these methods solve unconstrained least squares problems using algorithms such as Gauss-Newton and Levenberg-Marquardt. However, extending factor graphs with native support for equality constraints can improve solution accuracy and broaden their applicability, particularly in optimal control. In this paper, we propose a novel extension of factor graphs that seamlessly incorporates equality constraints without requiring additional optimization algorithms. Our approach maintains the efficiency and flexibility of existing second-order optimization techniques while ensuring constraint feasibility. To validate our method, we apply it to an optimal control problem for velocity tracking in autonomous vehicles and benchmark our results against state-of-the-art constraint handling techniques. Additionally, we introduce ecg2o, a header-only C++ library that extends the widely used g2o factor graph library by adding full support for equality-constrained optimization. This library, along with demonstrative examples and the optimal control problem, is available as open source at https://github.com/snt-arg/ecg2o
中文摘要:本文提出了一种新颖的因子图扩展方法,能够无缝整合等式约束,在保持计算效率的同时提高了解决方案精度,并拓宽了在最优控制等领域的应用范围。
English Summary: The paper introduces a novel extension to factor graphs that natively integrates equality constraints, enhancing solution accuracy and expanding applications in areas like optimal control while maintaining computational efficiency.

Authors:Yogesh Verma, Ayush Bharti, Vikas Garg
Title: Robust Simulation-Based Inference under Missing Data via Neural Processes
Abstract:
Simulation-based inference (SBI) methods typically require fully observed data to infer parameters of models with intractable likelihood functions. However, datasets often contain missing values due to incomplete observations, data corruptions (common in astrophysics), or instrument limitations (e.g., in high-energy physics applications). In such scenarios, missing data must be imputed before applying any SBI method. We formalize the problem of missing data in SBI and demonstrate that naive imputation methods can introduce bias in the estimation of SBI posterior. We also introduce a novel amortized method that addresses this issue by jointly learning the imputation model and the inference network within a neural posterior estimation (NPE) framework. Extensive empirical results on SBI benchmarks show that our approach provides robust inference outcomes compared to standard baselines for varying levels of missing data. Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). Code is available at https://github.com/Aalto-QuML/RISE.
中文: 模拟推理方法通常需要完整数据,但缺失值若简单填补会引入偏差,为此我们提出了一种新型摊销方法,在神经框架内联合学习填补与推理,确保参数估计的稳健性。
English: Simulation-based inference methods often require complete data, but missing values can introduce bias when naively imputed, prompting the development of a novel amortized approach that jointly learns imputation and inference within a neural framework to ensure robust parameter estimation.

Authors:Yuxuan Chen, Long Zhang, Xu Zhu, Hua Zhou, Zhuyin Ren
Title: OptMetaOpenFOAM: Large Language Model Driven Chain of Thought for Sensitivity Analysis and Parameter Optimization based on CFD
Abstract:
Merging natural language interfaces with computational fluid dynamics (CFD) workflows presents transformative opportunities for both industry and research. In this study, we introduce OptMetaOpenFOAM - a novel framework that bridges MetaOpenFOAM with external analysis and optimization tool libraries through a large language model (LLM)-driven chain-of-thought (COT) methodology. By automating complex CFD tasks via natural language inputs, the framework empowers non-expert users to perform sensitivity analyses and parameter optimizations with markedly improved efficiency. The test dataset comprises 11 distinct CFD analysis or optimization tasks, including a baseline simulation task derived from an OpenFOAM tutorial covering fluid dynamics, combustion, and heat transfer. Results confirm that OptMetaOpenFOAM can accurately interpret user requirements expressed in natural language and effectively invoke external tool libraries alongside MetaOpenFOAM to complete the tasks. Furthermore, validation on a non-OpenFOAM tutorial case - namely, a hydrogen combustion chamber - demonstrates that a mere 200-character natural language input can trigger a sequence of simulation, postprocessing, analysis, and optimization tasks spanning over 2,000 lines of code. These findings underscore the transformative potential of LLM-driven COT methodologies in linking external tool for advanced analysis and optimization, positioning OptMetaOpenFOAM as an effective tool that streamlines CFD simulations and enhances their convenience and efficiency for both industrial and research applications. Code is available at https://github.com/Terry-cyx/MetaOpenFOAM.
中文:OptMetaOpenFOAM框架通过大语言模型驱动的思维链方法将MetaOpenFOAM与外部分析工具集成,使非专业用户能够通过自然语言输入高效执行复杂CFD任务,并在多个仿真优化场景中验证了其有效性。
English: The OptMetaOpenFOAM framework integrates MetaOpenFOAM with external analysis tools using a large language model-driven chain-of-thought approach, enabling non-experts to efficiently perform complex CFD tasks through natural language inputs and demonstrating its effectiveness across multiple simulation and optimization scenarios.

Authors:Linhao Li, Changhui Su, Yu Guo, Huimao Zhang, Dong Liang, Kun Shang
Title: Interactive Gadolinium-Free MRI Synthesis: A Transformer with Localization Prompt Learning
Abstract:
Contrast-enhanced magnetic resonance imaging (CE-MRI) is crucial for tumor detection and diagnosis, but the use of gadolinium-based contrast agents (GBCAs) in clinical settings raises safety concerns due to potential health risks. To circumvent these issues while preserving diagnostic accuracy, we propose a novel Transformer with Localization Prompts (TLP) framework for synthesizing CE-MRI from non-contrast MR images. Our architecture introduces three key innovations: a hierarchical backbone that uses efficient Transformer to process multi-scale features; a multi-stage fusion system consisting of Local and Global Fusion modules that hierarchically integrate complementary information via spatial attention operations and cross-attention mechanisms, respectively; and a Fuzzy Prompt Generation (FPG) module that enhances the TLP model's generalization by emulating radiologists' manual annotation through stochastic feature perturbation. The framework uniquely enables interactive clinical integration by allowing radiologists to input diagnostic prompts during inference, synergizing artificial intelligence with medical expertise. This research establishes a new paradigm for contrast-free MRI synthesis while addressing critical clinical needs for safer diagnostic procedures. Codes are available at https://github.com/ChanghuiSu/TLP.
中文摘要:本研究提出了一种带定位提示的Transformer(TLP)框架,通过分层特征处理和交互式临床集成,能够从非增强MRI合成对比增强图像,在保持诊断准确性的同时避免了钆基造影剂的使用需求。
English Summary: This study introduces a Transformer with Localization Prompts (TLP) framework that synthesizes contrast-enhanced MRI from non-contrast images, eliminating the need for gadolinium-based agents while maintaining diagnostic accuracy through hierarchical feature processing and interactive clinical integration.

Authors:Xuan Zhu, Jijun Xiang, Xianqi Wang, Longliang Liu, Yu Wang, Hong Zhang, Fei Guo, Xin Yang
Title: SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion
Abstract:
Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets. Code is available at https://github.com/Lan1eve/SVDC.
中文摘要:本文提出SVDC视频深度补全方法,通过融合稀疏直接飞行时间数据与RGB引导,采用多帧融合和自适应频率选择模块,在增强边缘细节和时序一致性的同时有效抑制噪声。
English Summary: The paper introduces SVDC, a video depth completion method that fuses sparse dToF data with RGB guidance using multi-frame fusion and an adaptive frequency selective module to enhance edge details and temporal consistency while reducing noise.

Authors:Xiaolong Yu, Junqiao Zhao, Shuangfu Song, Zhongyang Zhu, Zihan Yuan, Chen Ye, Tiantian Feng
Title: Convex Hull-based Algebraic Constraint for Visual Quadric SLAM
Abstract:
Using Quadrics as the object representation has the benefits of both generality and closed-form projection derivation between image and world spaces. Although numerous constraints have been proposed for dual quadric reconstruction, we found that many of them are imprecise and provide minimal improvements to localization.After scrutinizing the existing constraints, we introduce a concise yet more precise convex hull-based algebraic constraint for object landmarks, which is applied to object reconstruction, frontend pose estimation, and backend bundle adjustment.This constraint is designed to fully leverage precise semantic segmentation, effectively mitigating mismatches between complex-shaped object contours and dual quadrics.Experiments on public datasets demonstrate that our approach is applicable to both monocular and RGB-D SLAM and achieves improved object mapping and localization than existing quadric SLAM methods. The implementation of our method is available at https://github.com/tiev-tongji/convexhull-based-algebraic-constraint.
Chinese: 本研究提出了一种基于凸包的精确代数约束方法,通过充分利用语义分割技术,在单目和RGB-D SLAM系统中实现了比现有方法更优的物体建图与定位效果。
English: This study introduces a precise convex hull-based algebraic constraint for dual quadric reconstruction that enhances object mapping and localization in both monocular and RGB-D SLAM systems by leveraging accurate semantic segmentation.

Authors:Cong Ma, Du Wu, Zhelang Deng, Jiang Chen, Xiaowen Huang, Jintao Meng, Wenxi Zhu, Bingqiang Wang, Amelie Chi Zhou, Peng Chen, Minwen Deng, Yanjie Wei, Shengzhong Feng, Yi Pan
Title: NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU
Abstract:
Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.
中文: 本研究提出NM-SpMM,一种针对N:M稀疏性的高效实现方案,通过分层分块和内存访问优化,在保持模型性能的同时,使矩阵运算速度比密集方法提升最高达6.3倍。
English: The study introduces NM-SpMM, an efficient implementation for N:M sparsity that utilizes hierarchical blocking and memory optimizations to significantly accelerate matrix operations, achieving up to 6.3x speedup over dense methods while maintaining model performance.

Authors:Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, Risheng Liu
Title: Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
Abstract:
Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.
中文摘要:提出的SAGE方法利用Segment Anything Model的语义知识,通过新型注意力模块和蒸馏机制实现多模态图像融合的平衡,在提升视觉质量的同时增强了下游任务的适应性。
English Summary: The proposed SAGE method leverages semantic knowledge from the Segment Anything Model to achieve balanced multi-modality image fusion, enhancing both visual quality and downstream task adaptability through a novel attention module and distillation mechanism.

Authors:Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu
Title: Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Abstract:
Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.
中文: 研究表明多模态大语言模型在部分小批量训练中会通过虚假关联无意记忆与任务无关的私有内容,水印实验显示即使这些内容不影响输出,模型仍会触发不同的表征模式。
English: This study reveals that multi-modal large language models can inadvertently memorize task-irrelevant private content through spurious correlations in partial mini-batch training, as demonstrated by watermark experiments showing distinct representational patterns even when such content doesn't affect model outputs.

Authors:Xingyuan Li, Zirui Wang, Yang Zou, Zhixin Chen, Jun Ma, Zhiying Jiang, Long Ma, Jinyuan Liu
Title: DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
Abstract:
Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance. Code is available at https://github.com/zirui0625/DifIISR
中文摘要:提出的DifIISR模型通过融合热光谱分布调控以保持视觉保真度,并利用视觉基础模型作为感知指导,显著提升了红外图像超分辨率效果,在视觉质量和下游任务性能上均达到领先水平。
English Summary: The proposed DifIISR model enhances infrared image super-resolution by integrating thermal spectrum regulation for visual fidelity and leveraging visual foundation models as perceptual guidance, achieving superior image quality and state-of-the-art performance in downstream tasks.

Authors:Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa
Title: Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers
Abstract:
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .
中文摘要:OPTS通过引入显式策略选择机制优化大语言模型的提示设计,其中基于汤普森采样的方法在提升EvoPrompt性能方面表现最佳。
English Summary: OPTS introduces explicit strategy selection mechanisms to optimize prompts for large language models, with a Thompson sampling-based approach showing the best performance in enhancing EvoPrompt's effectiveness.

Authors:Chen Zhang, Mingxu Tao, Zhiyuan Liao, Yansong Feng
Title: MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages
Abstract:
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems. Its parallelism between tasks and languages can provide a faithful and fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that open-source LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.
Chinese: MiLiC-Eval 是一个专为中国少数民族语言设计的评估基准,揭示了开源大语言模型在语法密集型任务和多文字语言处理上的不足,同时推动了低资源语言在文字系统适应方面的研究进展。
English: MiLiC-Eval is a benchmark introduced to assess the performance of large language models on underrepresented minority languages in China, revealing their struggles with syntax-intensive tasks and diverse writing systems while aiding research in language adaptation.

Authors:Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
Title: WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Abstract:
Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce WeGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable WeGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in WeGen and show it achieves state-of-the-art performance across various visual generation benchmarks. These also demonstrate the potential of WeGen as a user-friendly design copilot as desired. The code and models will be available at https://github.com/hzphzp/WeGen.
Chinese: WeGen是一种统一的多模态模型,通过迭代生成提升创造性和一致性,作为用户友好的设计助手实现了最先进的性能。
English: WeGen is a unified multimodal model that enhances creativity and consistency in iterative generation, achieving state-of-the-art performance as a user-friendly design copilot.

Authors:Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Shengyong Chen
Title: SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
Abstract:
Pixel-level segmentation of structural cracks across various scenarios remains a considerable challenge. Current methods encounter challenges in effectively modeling crack morphology and texture, facing challenges in balancing segmentation quality with low computational resource usage. To overcome these limitations, we propose a lightweight Structure-Aware Vision Mamba Network (SCSegamba), capable of generating high-quality pixel-level segmentation maps by leveraging both the morphological information and texture cues of crack pixels with minimal computational cost. Specifically, we developed a Structure-Aware Visual State Space module (SAVSS), which incorporates a lightweight Gated Bottleneck Convolution (GBC) and a Structure-Aware Scanning Strategy (SASS). The key insight of GBC lies in its effectiveness in modeling the morphological information of cracks, while the SASS enhances the perception of crack topology and texture by strengthening the continuity of semantic information between crack pixels. Experiments on crack benchmark datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods, achieving the highest performance with only 2.8M parameters. On the multi-scenario dataset, our method reached 0.8390 in F1 score and 0.8479 in mIoU. The code is available at https://github.com/Karl1109/SCSegamba.
中文:提出的轻量级结构感知视觉Mamba网络(SCSegamba)通过建模裂缝形态和纹理,以最小计算资源有效解决分割难题,在基准数据集上实现了最优性能。
English: The proposed lightweight Structure-Aware Vision Mamba Network (SCSegamba) effectively addresses crack segmentation challenges by modeling morphology and texture with minimal computational resources, achieving state-of-the-art performance on benchmark datasets.

Authors:Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Title: Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
Abstract:
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512x512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256x256.
Chinese: 本文提出直接判别优化(DDO)框架,将基于似然的生成训练与GAN式判别相结合,突破最大似然估计的限制,在无需引导机制的情况下于多个数据集上实现了最先进的生成性能。
English: This paper introduces Direct Discriminative Optimization (DDO), a unified framework that combines likelihood-based generative training with GAN-type discrimination to overcome the limitations of maximum likelihood estimation, achieving state-of-the-art performance across multiple datasets without guidance mechanisms.

Authors:Wanjun Jia, Fan Yang, Mengfei Duan, Xianchi Chen, Yinxi Wang, Yiming Jiang, Wenrui Chen, Kailun Yang, Zhiyong Li
Title: One-Shot Affordance Grounding of Deformable Objects in Egocentric Organizing Scenes
Abstract:
Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, which effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset are made publicly available at https://github.com/Dikay1/OS-AGDO.
中文: 本文提出了一种新颖的单次可变形物体功能基元感知方法(OS-AGDO),通过语义增强模块和优化特征提取技术,使机器人能够用最少样本识别不同形态的可变形物体,在真实场景实验中显著超越了现有最优方法。
English: This paper introduces a novel One-Shot Affordance Grounding method for Deformable Objects (OS-AGDO) that enables robots to recognize unseen deformable objects with minimal samples through semantic enhancement and optimized feature extraction, achieving superior performance over state-of-the-art methods in real-world experiments.

Authors:Yu Fu, Michael Stanley Smith, Anastasios Panagiotelis
Title: Vector Copula Variational Inference and Dependent Block Posterior Approximations
Abstract:
The key to VI is the selection of a tractable density to approximate the Bayesian posterior. For large and complex models a common choice is to assume independence between multivariate blocks in a partition of the parameter space. While this simplifies the problem it can reduce accuracy. This paper proposes using vector copulas to capture dependence between the blocks parsimoniously. Tailored multivariate marginals are constructed using learnable transport maps. We call the resulting joint distribution a ``dependent block posterior'' approximation. Vector copula models are suggested that make tractable and flexible variational approximations. They allow for differing marginals, numbers of blocks, block sizes and forms of between block dependence. They also allow for solution of the variational optimization using efficient stochastic gradient methods. The approach is demonstrated using four different statistical models and 16 datasets which have posteriors that are challenging to approximate. This includes models that use global-local shrinkage priors for regularization, and hierarchical models for smoothing and heteroscedastic time series. In all cases, our method produces more accurate posterior approximations than benchmark VI methods that either assume block independence or factor-based dependence, at limited additional computational cost. A python package implementing the method is available on GitHub at https://github.com/YuFuOliver/VCVI_Rep_PyPackage.
中文: 本文提出利用向量耦合函数来简洁地捕捉参数块间依赖关系的变分推断方法,通过可学习的传输映射构建定制化多元边缘分布,在有限计算成本下实现了比基准方法更精确的后验逼近。
English: This paper introduces a variational inference method using vector copulas to capture block dependence parsimoniously, constructing tailored multivariate marginals with learnable transport maps to achieve more accurate posterior approximations than benchmark methods at limited computational cost.

Authors:Jacob Beck
Title: Offline RLAIF: Piloting VLM Feedback for RL via SFO
Abstract:
While internet-scale image and textual data have enabled strong generalization in Vision-Language Models (VLMs), the absence of internet-scale control data has impeded the development of similar generalization in standard reinforcement learning (RL) agents. Although VLMs are fundamentally limited in their ability to solve control tasks due to their lack of action-conditioned training data, their capacity for image understanding allows them to provide valuable feedback in RL tasks by recognizing successful outcomes. A key challenge in Reinforcement Learning from AI Feedback (RLAIF) is determining how best to integrate VLM-derived signals into the learning process. We explore this question in the context of offline RL and introduce a class of methods called Sub-Trajectory Filtered Optimization (SFO). We identify three key insights. First, trajectory length plays a crucial role in offline RL, as full-trajectory preference learning exacerbates the stitching problem, necessitating the use of sub-trajectories. Second, even in Markovian environments, a non-Markovian reward signal from a sequence of images is required to assess trajectory improvement, as VLMs do not interpret control actions and must rely on visual cues over time. Third, a simple yet effective approach--filtered and weighted behavior cloning--consistently outperforms more complex RLHF-based methods. We propose Sub-Trajectory Filtered Behavior Cloning (SFBC), a method that leverages VLM feedback on sub-trajectories while incorporating a retrospective filtering mechanism that removes sub-trajectories preceding failures to improve robustness and prevent turbulence. Please enjoy our airport puns.
中文: 视觉语言模型通过为子轨迹提供非马尔可夫反馈来增强强化学习,由此提出的子轨迹过滤行为克隆方法通过过滤行为克隆,优于更复杂的方法。
English: Vision-Language Models enhance reinforcement learning by providing non-Markovian feedback on sub-trajectories, leading to the development of Sub-Trajectory Filtered Behavior Cloning, which outperforms complex methods through filtered behavior cloning.

Authors:Lie Ju, Sijin Zhou, Yukun Zhou, Huimin Lu, Zhuoting Zhu, Pearse A. Keane, Zongyuan Ge
Title: Delving into Out-of-Distribution Detection with Medical Vision-Language Models
Abstract:
Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.
中文: 本研究开创性地系统评估了医学视觉语言模型的分布外检测能力,提出了跨模态基准和新型分层提示方法,显著提升了检测性能。
English: This study pioneers a systematic evaluation of out-of-distribution detection in medical vision-language models, introducing a cross-modality benchmark and a novel hierarchical prompt method that significantly improves performance.

Authors:Yuhang Zhang, Zhiyao Zhang, Junyi Ji, Marcos Quiñones-Grueiro, William Barbour, Derek Gloudemans, Gergely Zachár, Clay Weston, Gautam Biswas, Daniel B. Work
Title: Real-World Deployment and Assessment of a Multi-Agent Reinforcement Learning-Based Variable Speed Limit Control System
Abstract:
This article presents the first field deployment of a multi-agent reinforcement learning (MARL) based variable speed limit (VSL) control system on Interstate 24 (I-24) near Nashville, Tennessee. We design and demonstrate a full pipeline from training MARL agents in a traffic simulator to a field deployment on a 17-mile segment of I-24 encompassing 67 VSL controllers. The system was launched on March 8th, 2024, and has made approximately 35 million decisions on 28 million trips in six months of operation. We apply an invalid action masking mechanism and several safety guards to ensure real-world constraints. The MARL-based implementation operates up to 98% of the time, with the safety guards overriding the MARL decisions for the remaining time. We evaluate the performance of the MARL-based algorithm in comparison to a previously deployed non-RL VSL benchmark algorithm on I-24. Results show that the MARL-based VSL control system achieves a superior performance. The accuracy of correctly warning drivers about slowing traffic ahead is improved by 14% and the response delay to non-recurrent congestion is reduced by 75%. The preliminary data shows that the VSL control system has reduced the crash rate by 26% and the secondary crash rate by 50%. We open-sourced the deployed MARL-based VSL algorithm at https://github.com/Lab-Work/marl-vsl-controller.
中文: 本研究首次在24号州际公路17英里路段部署了基于多智能体强化学习的可变限速控制系统,相比原有方法显著提升了交通预警准确率14%、拥堵响应速度75%,并有效降低了26%的事故率和50%的二次事故率。
English: This study details the first real-world deployment of a multi-agent reinforcement learning-based variable speed limit control system on a 17-mile stretch of Interstate 24, demonstrating superior performance over previous methods with significant improvements in traffic warning accuracy, congestion response speed, and crash reduction rates.

Authors:Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
Title: Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
Abstract:
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.
中文: 本研究提出EgoVideo模型,通过创新的HOD流程和运动适配器将细粒度手物动态整合到自我中心视频表征学习中,在多个下游任务中实现了领先性能。
English: This work introduces EgoVideo, a model that integrates fine-grained hand-object dynamics into egocentric video representation learning using a novel HOD pipeline and motion adapter, achieving state-of-the-art performance across multiple downstream tasks.

Authors:Dien X. Tran, Nam V. Nguyen, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
Title: SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking
Abstract:
The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97\% strict accuracy on ISE-DSC01 and 80.82\% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.
中文:SemViQA提出了一种创新的越南语事实核查框架,融合了语义证据检索和两步裁决分类,在实现顶尖准确率的同时显著提升推理速度,为打击虚假信息树立了新标杆。
English: SemViQA introduces a novel Vietnamese fact-checking framework that combines Semantic-based Evidence Retrieval and Two-step Verdict Classification, achieving state-of-the-art accuracy and improved inference speed to combat misinformation effectively.

Authors:Dien X. Tran, Nam V. Nguyen, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
Title: SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking
Abstract:
The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97\% strict accuracy on ISE-DSC01 and 80.82\% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.
中文:SemViQA提出了一种创新的越南语事实核查框架,融合了语义证据检索和两步裁决分类,在实现顶尖准确率的同时显著提升推理速度,为打击虚假信息树立了新标杆。
English: SemViQA introduces a novel Vietnamese fact-checking framework that combines Semantic-based Evidence Retrieval and Two-step Verdict Classification, achieving state-of-the-art accuracy and improved inference speed to combat misinformation effectively.

Authors:Xingzhuo Guo, Yu Zhang, Baixu Chen, Haoran Xu, Jianmin Wang, Mingsheng Long
Title: Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models
Abstract:
Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: https://github.com/thuml/dynamical-diffusion.
中文: Dynamical Diffusion (DyDiff) 提出了一种时间感知框架,通过显式建模时间动态增强扩散模型,从而在多种预测任务中提升性能。
English: Dynamical Diffusion (DyDiff) introduces a temporally aware framework that enhances diffusion models by explicitly modeling temporal dynamics, leading to improved performance in various predictive tasks.

Authors:Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q. Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H. F. Ng, Qing Li
Title: HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
Abstract:
Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, https://github.com/jzzzzh/HiBench, to encourage evaluation.
中文: HiBench是一个全面评估大语言模型层次推理能力的框架,通过涵盖六个场景和30项任务弥补现有基准的不足,研究发现LLMs在基础层次任务表现良好,但在复杂结构和隐式表征方面仍有困难,而针对性指令数据能显著提升其性能。
English: HiBench is a comprehensive framework designed to evaluate the hierarchical reasoning capabilities of large language models, addressing the gap in existing benchmarks by covering six scenarios and 30 tasks, and it reveals that while LLMs excel in basic hierarchical tasks, they struggle with complex structures and implicit representations, with performance significantly improved through targeted instruction data.

Authors:Zhu Liu, Zijun Wang, Jinyuan Liu, Fanqi Meng, Long Ma, Risheng Liu
Title: DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
Abstract:
Thermal imaging is often compromised by dynamic, complex degradations caused by hardware limitations and unpredictable environmental factors. The scarcity of high-quality infrared data, coupled with the challenges of dynamic, intricate degradations, makes it difficult to recover details using existing methods. In this paper, we introduce thermal degradation simulation integrated into the training process via a mini-max optimization, by modeling these degraded factors as adversarial attacks on thermal images. The simulation is dynamic to maximize objective functions, thus capturing a broad spectrum of degraded data distributions. This approach enables training with limited data, thereby improving model performance.Additionally, we introduce a dual-interaction network that combines the benefits of spiking neural networks with scale transformation to capture degraded features with sharp spike signal intensities. This architecture ensures compact model parameters while preserving efficient feature representation. Extensive experiments demonstrate that our method not only achieves superior visual quality under diverse single and composited degradation, but also delivers a significant reduction in processing when trained on only fifty clear images, outperforming existing techniques in efficiency and accuracy. The source code will be available at https://github.com/LiuZhu-CV/DEAL.
中文: 本文提出了一种热成像退化模拟方法,通过极小极大优化将退化建模为对抗攻击,结合脉冲神经网络与尺度变换的双交互网络,在有限数据下实现高效训练和模型性能提升。
English: This paper introduces a thermal degradation simulation method using mini-max optimization to model degradations as adversarial attacks, enabling effective training with limited data and enhanced model performance through a dual-interaction network that combines spiking neural networks with scale transformation for efficient feature representation.

Authors:Jing Peng, Meiqi Yang, Qiong Zhang, Xiaoxiao Li
Title: S4M: S4 for multivariate time series forecasting with Missing values
Abstract:
Multivariate time series data play a pivotal role in a wide range of real-world applications. However, the presence of block missing data introduces significant challenges, often compromising the performance of predictive models. Traditional two-step approaches, which first impute missing values and then perform forecasting, are prone to error accumulation, particularly in complex multivariate settings characterized by high missing ratios and intricate dependency structures. In this work, we introduce S4M, an end-to-end time series forecasting framework that seamlessly integrates missing data handling into the Structured State Space Sequence (S4) model architecture. Unlike conventional methods that treat imputation as a separate preprocessing step, S4M leverages the latent space of S4 models to directly recognize and represent missing data patterns, thereby more effectively capturing the underlying temporal and multivariate dependencies. Our framework comprises two key components: the Adaptive Temporal Prototype Mapper (ATPM) and the Missing-Aware Dual Stream S4 (MDS-S4). The ATPM employs a prototype bank to derive robust and informative representations from historical data patterns, while the MDS-S4 processes these representations alongside missingness masks as dual input streams to enable accurate forecasting. Through extensive empirical evaluations on diverse real-world datasets, we demonstrate that S4M consistently achieves state-of-the-art performance. These results underscore the efficacy of our integrated approach in handling missing data, showcasing its robustness and superiority over traditional imputation-based methods. Our findings highlight the potential of S4M to advance reliable time series forecasting in practical applications, offering a promising direction for future research and deployment. Code is available at https://github.com/WINTERWEEL/S4M.git.
中文: S4M框架将缺失数据处理直接整合到结构化状态空间序列模型中,通过自适应时间映射和双流处理,无需独立插补步骤即可实现最先进的预测性能。
English: The S4M framework integrates missing data handling directly into the Structured State Space Sequence model, using adaptive temporal mapping and dual-stream processing to achieve state-of-the-art forecasting performance without separate imputation steps.

Authors:Qia Hu, Bo Jiao
Title: Hierarchical graph sampling based minibatch learning with chain preservation and variance reduction
Abstract:
Graph sampling-based Graph Convolutional Networks (GCNs) decouple sampling from forward and backward propagation during minibatch training, enhancing scalability with respect to layer depth and graph size. We propose HIS_GCNs, a hierarchical importance sampling-based learning method. By constructing minibatches using sampled subgraphs, HIS_GCNs focuses on the importance of both the core and periphery in a scale-free training graph. Specifically, it preserves the centrum of the core in most minibatches, which maintains connectivity between periphery nodes, and samples periphery edges without core node interference, which allows longer chains composed entirely of low-degree nodes remain within the same minibatch. HIS_GCNs can maximize the discrete Ricci curvature (i.e., Ollivier-Ricci curvatures) of the edges in a subgraph, enabling preservation of important chains for information propagation. This approach can achieve a low node embedding variance and a high convergence speed. Diverse experiments on Graph Neural Networks (GNNs) with node classification tasks confirmed the superior performance of HIS_GCNs in terms of both accuracy and training time. Open-source code (https://github.com/HuQiaCHN/HIS-GCN).
中文摘要:HIS_GCNs提出了一种基于层次重要性采样的图卷积网络方法,通过保持无标度图中核心-边缘结构来构建小批量训练样本,从而优化信息传播路径,实现了更高精度和更快收敛速度。
English Summary: HIS_GCNs introduces a hierarchical importance sampling method that constructs minibatches by preserving core-periphery structures in scale-free graphs, achieving higher accuracy and faster convergence through optimized information propagation paths.

Authors:Rui Yi Yong, Samuel Picosson, Arnold Wiliem
Title: MTReD: 3D Reconstruction Dataset for Fly-over Videos of Maritime Domain
Abstract:
This work tackles 3D scene reconstruction for a video fly-over perspective problem in the maritime domain, with a specific emphasis on geometrically and visually sound reconstructions. This will allow for downstream tasks such as segmentation, navigation, and localization. To our knowledge, there is no dataset available in this domain. As such, we propose a novel maritime 3D scene reconstruction benchmarking dataset, named as MTReD (Maritime Three-Dimensional Reconstruction Dataset). The MTReD comprises 19 fly-over videos curated from the Internet containing ships, islands, and coastlines. As the task is aimed towards geometrical consistency and visual completeness, the dataset uses two metrics: (1) Reprojection error; and (2) Perception based metrics. We find that existing perception-based metrics, such as Learned Perceptual Image Patch Similarity (LPIPS), do not appropriately measure the completeness of a reconstructed image. Thus, we propose a novel semantic similarity metric utilizing DINOv2 features coined DiFPS (DinoV2 Features Perception Similarity). We perform initial evaluation on two baselines: (1) Structured from Motion (SfM) through Colmap; and (2) the recent state-of-the-art MASt3R model. We find that the reconstructed scenes by MASt3R have higher reprojection errors, but superior perception based metric scores. To this end, some pre-processing methods are explored, and we find a pre-processing method which improves both the reprojection error and perception-based score. We envisage our proposed MTReD to stimulate further research in these directions. The dataset and all the code will be made available in https://github.com/RuiYiYong/MTReD.
中文: 本研究提出MTReD这一新型海事三维重建数据集,填补了飞越视角视频数据空白,通过新提出的语义相似度指标DiFPS进行评估,并展示了预处理方法对提升重建质量的有效性。
English: This study introduces MTReD, a novel maritime 3D reconstruction dataset addressing the lack of fly-over video data, featuring a new semantic similarity metric (DiFPS) and demonstrating improved reconstruction performance through preprocessing methods.

Authors:Yalun Dai, Lingao Xiao, Ivor W. Tsang, Yang He
Title: Training-Free Dataset Pruning for Instance Segmentation
Abstract:
Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel-level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification-based pruning methods proves ineffective due to their reliance on time-consuming model training process. To address this, we propose a novel Training-Free Dataset Pruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale-Invariant (SI-SCS) and Class-Balanced (CB-SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state-of-the-art results on VOC 2012, Cityscapes, and COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of 1349$\times$ on COCO compared to the adapted baselines. Source code is available at: https://github.com/he-y/dataset-pruning-for-instance-segmentation
中文: 本文提出了一种无需训练的实例分割数据集剪枝方法(TFDP),利用标注中的形状和类别信息设计复杂度评分,无需模型训练即可在多个数据集上取得最优性能,并将剪枝速度提升了1300倍以上。
English: This paper introduces a Training-Free Dataset Pruning (TFDP) method for instance segmentation that uses shape and class information from annotations to create complexity scores, achieving state-of-the-art results and accelerating pruning by over 1300 times without model training.

Authors:Bowen Zheng, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
Title: Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Abstract:
The ability to learn new concepts while preserve the learned knowledge is desirable for learning systems in Class-Incremental Learning (CIL). Recently, feature expansion of the model become a prevalent solution for CIL, where the old features are fixed during the training of the new task while new features are expanded for the new tasks. However, such task-specific features learned from the new task may collide with the old features, leading to misclassification between tasks. Therefore, the expanded model is often encouraged to capture diverse features from the new task, aiming to avoid such collision. However, the existing solution is largely restricted to the samples from the current task, because of the poor accessibility to previous samples. To promote the learning and transferring of diverse features across tasks, we propose a framework called Task-Agnostic Guided Feature Expansion (TagFex). Firstly, it captures task-agnostic features continually with a separate model, providing extra task-agnostic features for subsequent tasks. Secondly, to obtain useful features from the task-agnostic model for the current task, it aggregates the task-agnostic features with the task-specific feature using a merge attention. Then the aggregated feature is transferred back into the task-specific feature for inference, helping the task-specific model capture diverse features. Extensive experiments show the effectiveness and superiority of TagFex on various CIL settings. Code is available at https://github.com/bwnzheng/TagFex_CVPR2025.
中文: 提出的TagFex框架通过持续捕获任务无关特征,并利用注意力机制将其与任务特定特征融合,有效解决了类增量学习中的特征冲突问题,实现了跨任务的多样化特征学习与知识保持。
English: The proposed TagFex framework addresses feature collision in Class-Incremental Learning by continually capturing task-agnostic features and merging them with task-specific features through attention mechanisms, enabling diverse feature learning across tasks while maintaining performance.

Authors:Lu Ma, Kaibo Cao, Hao Liang, Jiaxin Lin, Zhuang Li, Yuhong Liu, Jihong Zhang, Wentao Zhang, Bin Cui
Title: Evaluating and Predicting Distorted Human Body Parts for Generated Images
Abstract:
Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{https://github.com/TheRoadQaQ/Predicting-Distortion}{https://github.com/TheRoadQaQ/Predicting-Distortion}.
Chinese: 本研究推出了首个用于识别AI生成图像中人体畸变的标注数据集Distortion-5K,并提出基于视觉Transformer的ViT-HD模型,该模型在检测解剖结构异常方面优于现有方法,揭示近半数生成图像存在此类缺陷。
English: This study introduces Distortion-5K, the first annotated dataset for identifying human body distortions in AI-generated images, and proposes ViT-HD, a Vision Transformer model that outperforms existing methods in detecting anatomical inaccuracies, revealing nearly half of generated images contain such flaws.

Authors:Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
Title: Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Abstract:
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks(Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning, which shares similar intuition with Thrush et al.(2024). To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.
Chinese: 本研究提出PreSelect方法,通过轻量级评分器高效筛选预训练数据,仅用300亿标记中的10%数据训练模型,在减少90%计算量的同时显著超越基线模型性能。
English: This study introduces PreSelect, an efficient data selection method that uses a lightweight scorer to identify high-quality pretraining data, achieving superior model performance with 10x less compute by training on only 30B tokens instead of 300B.

Authors:Xulin Chen, Junzhou Huang
Title: DELST: Dual Entailment Learning for Hyperbolic Image-Gene Pretraining in Spatial Transcriptomics
Abstract:
Spatial transcriptomics (ST) maps gene expression within tissue at individual spots, making it a valuable resource for multimodal representation learning. Additionally, ST inherently contains rich hierarchical information both across and within modalities. For instance, different spots exhibit varying numbers of nonzero gene expressions, corresponding to different levels of cellular activity and semantic hierarchies. However, existing methods rely on contrastive alignment of image-gene pairs, failing to accurately capture the intricate hierarchical relationships in ST data. Here, we propose DELST, the first framework to embed hyperbolic representations while modeling hierarchy for image-gene pretraining at two levels: (1) Cross-modal entailment learning, which establishes an order relationship between genes and images to enhance image representation generalization; (2) Intra-modal entailment learning, which encodes gene expression patterns as hierarchical relationships, guiding hierarchical learning across different samples at a global scale and integrating biological insights into single-modal representations. Extensive experiments on ST benchmarks annotated by pathologists demonstrate the effectiveness of our framework, achieving improved predictive performance compared to existing methods. Our code and models are available at: https://github.com/XulinChen/DELST.
中文: 提出的DELST框架通过双模态和单模态蕴含学习引入双曲表征嵌入,有效捕捉空间转录组数据中的层级关系,在基准测试中优于现有方法。
English: The proposed DELST framework introduces hyperbolic representation embedding with cross-modal and intra-modal entailment learning to capture hierarchical relationships in spatial transcriptomics data, outperforming existing methods on benchmarks.

Authors:Jayden Teoh, Pradeep Varakantham, Peter Vamplew
Title: On Generalization Across Environments In Multi-Objective Reinforcement Learning
Abstract:
Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel benchmark featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this benchmark reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments. We make our code available at https://github.com/JaydenTeoh/MORL-Generalization.
Chinese: 本文提出了多目标强化学习的新基准,旨在解决跨环境泛化的关键问题,揭示了现有算法的局限性,并强调了多目标规范对于实现有效泛化的重要性。
English: This paper introduces a novel benchmark for Multi-Objective Reinforcement Learning (MORL) to address the critical gap in generalization across diverse environments, revealing limited capabilities of current algorithms and emphasizing the need for multi-objective specifications.

Authors:Ukcheol Shin, Kyunghyun Lee, Jean Oh
Title: Bridging Spectral-wise and Multi-spectral Depth Estimation via Geometry-guided Contrastive Learning
Abstract:
Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cue. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.
Chinese: 本文提出一种对齐与融合策略,用于多光谱图像的稳健深度估计,通过对齐不同光谱的嵌入空间并选择性融合特征,实现可靠、内存高效且灵活的预测结果。
English: This paper introduces an align-and-fuse strategy for robust depth estimation from multi-spectral images, which aligns embedding spaces across spectra and selectively fuses features to achieve reliable, memory-efficient, and flexible predictions.

Authors:Kai Lv, Honglin Guo, Qipeng Guo, Xipeng Qiu
Title: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Abstract:
Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.
中文摘要:DuoDecoding是一种新颖的推测解码方法,通过在CPU和GPU上分别部署草稿模型和目标模型,在保持输出质量的同时实现了最高2.61倍的生成加速。
English Summary: DuoDecoding is a novel speculative decoding method that strategically deploys draft and target models on CPU and GPU respectively, achieving up to 2.61x speedup in generation latency while maintaining output quality.

Authors:Elahe Delavari, Aws Khalil, Jaerock Kwon
Title: CARIL: Confidence-Aware Regression in Imitation Learning for Autonomous Driving
Abstract:
End-to-end vision-based imitation learning has demonstrated promising results in autonomous driving by learning control commands directly from expert demonstrations. However, traditional approaches rely on either regressionbased models, which provide precise control but lack confidence estimation, or classification-based models, which offer confidence scores but suffer from reduced precision due to discretization. This limitation makes it challenging to quantify the reliability of predicted actions and apply corrections when necessary. In this work, we introduce a dual-head neural network architecture that integrates both regression and classification heads to improve decision reliability in imitation learning. The regression head predicts continuous driving actions, while the classification head estimates confidence, enabling a correction mechanism that adjusts actions in low-confidence scenarios, enhancing driving stability. We evaluate our approach in a closed-loop setting within the CARLA simulator, demonstrating its ability to detect uncertain actions, estimate confidence, and apply real-time corrections. Experimental results show that our method reduces lane deviation and improves trajectory accuracy by up to 50%, outperforming conventional regression-only models. These findings highlight the potential of classification-guided confidence estimation in enhancing the robustness of vision-based imitation learning for autonomous driving. The source code is available at https://github.com/ElaheDlv/Confidence_Aware_IL.
中文摘要:本研究提出一种结合回归与分类的双头神经网络,通过实时置信度估计和动作校正机制提升自动驾驶可靠性,在CARLA模拟器中显著减少车道偏离并提高轨迹精度达50%。
English Summary: This study introduces a dual-head neural network combining regression and classification to enhance autonomous driving reliability by enabling real-time confidence estimation and action correction, significantly reducing lane deviation and improving trajectory accuracy in CARLA simulations.

Authors:Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao
Title: Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity
Abstract:
Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs' personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at https://github.com/hypasd-art/ETAPP.
中文: 本文提出ETAPP基准,用于评估大语言模型的个性化工具调用,包含沙盒环境、800个测试用例及基于关键点的评估方法以减少偏差,同时分析工具调用策略和微调效果。
English: This paper introduces ETAPP, a benchmark for evaluating personalized tool invocation in LLMs, featuring a sandbox environment, 800 test cases, and a key-point-based evaluation method to reduce bias, while analyzing tool-invoking strategies and fine-tuning effects.

Authors:Ziwei Huang, Jianan Zhou, Zhiguang Cao, Yixin Xu
Title: Rethinking Light Decoder-based Solvers for Vehicle Routing Problems
Abstract:
Light decoder-based solvers have gained popularity for solving vehicle routing problems (VRPs) due to their efficiency and ease of integration with reinforcement learning algorithms. However, they often struggle with generalization to larger problem instances or different VRP variants. This paper revisits light decoder-based approaches, analyzing the implications of their reliance on static embeddings and the inherent challenges that arise. Specifically, we demonstrate that in the light decoder paradigm, the encoder is implicitly tasked with capturing information for all potential decision scenarios during solution construction within a single set of embeddings, resulting in high information density. Furthermore, our empirical analysis reveals that the overly simplistic decoder struggles to effectively utilize this dense information, particularly as task complexity increases, which limits generalization to out-of-distribution (OOD) settings. Building on these insights, we show that enhancing the decoder capacity, with a simple addition of identity mapping and a feed-forward layer, can considerably alleviate the generalization issue. Experimentally, our method significantly enhances the OOD generalization of light decoder-based approaches on large-scale instances and complex VRP variants, narrowing the gap with the heavy decoder paradigm. Our code is available at: https://github.com/ziweileonhuang/reld-nco.
Chinese: 轻量解码器求解器在车辆路径问题中因静态嵌入信息密度过高和解码器设计过于简化而面临泛化挑战,但通过增加恒等映射和前馈层提升解码器能力,可显著改善分布外泛化性能。
English: Light decoder-based solvers for vehicle routing problems face generalization issues due to high information density in static embeddings and simplistic decoder design, but enhancing decoder capacity with identity mapping and a feed-forward layer significantly improves out-of-distribution generalization.

Authors:Xingbo Fu, Yinhan He, Jundong Li
Title: Edge Prompt Tuning for Graph Neural Networks
Abstract:
Pre-training powerful Graph Neural Networks (GNNs) with unlabeled graph data in a self-supervised manner has emerged as a prominent technique in recent years. However, inevitable objective gaps often exist between pre-training and downstream tasks. To bridge this gap, graph prompt tuning techniques design and learn graph prompts by manipulating input graphs or reframing downstream tasks as pre-training tasks without fine-tuning the pre-trained GNN models. While recent graph prompt tuning methods have proven effective in adapting pre-trained GNN models for downstream tasks, they overlook the crucial role of edges in graph prompt design, which can significantly affect the quality of graph representations for downstream tasks. In this study, we propose EdgePrompt, a simple yet effective graph prompt tuning method from the perspective of edges. Unlike previous studies that design prompt vectors on node features, EdgePrompt manipulates input graphs by learning additional prompt vectors for edges and incorporates the edge prompts through message passing in the pre-trained GNN models to better embed graph structural information for downstream tasks. Our method is compatible with prevalent GNN architectures pre-trained under various pre-training strategies and is universal for different downstream tasks. We provide comprehensive theoretical analyses of our method regarding its capability of handling node classification and graph classification as downstream tasks. Extensive experiments on ten graph datasets under four pre-training strategies demonstrate the superiority of our proposed method against six baselines. Our code is available at https://github.com/xbfu/EdgePrompt.
中文: 本文提出EdgePrompt方法,通过为边学习提示向量并借助消息传递机制融入预训练图神经网络,有效提升了图结构信息在下游任务中的表示质量,在多种预训练策略和数据集上展现出卓越性能。
English: This paper introduces EdgePrompt, an innovative graph prompt tuning method that enhances downstream task performance by learning edge-specific prompt vectors and integrating them through message passing in pre-trained GNN models, demonstrating superior results across diverse datasets and pre-training strategies.

Authors:Zihao Luo, Zijun Gao, Wenjun Liao, Shichuan Zhang, Guotai Wang, Xiangde Luo
Title: Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model
Abstract:
Accurate lymph node (LN) segmentation is critical in radiotherapy treatment and prognosis analysis, but is limited by the need for large annotated datasets. While deep learning-based segmentation foundation models show potential in developing high-performing models with fewer samples, their medical adaptation faces LN domain-specific prior deficiencies and inefficient few-shot fine-tuning for complex clinical practices, highlighting the necessity of an LN segmentation foundation model. In this work, we annotated 36,106 visible LNs from 3,346 publicly available head-and-neck CT scans to establish a robust LN segmentation model (nnUNetv2). Building on this, we propose Dynamic Gradient Sparsification Training (DGST), a few-shot fine-tuning approach that preserves foundational knowledge while dynamically updating the most critical parameters of the LN segmentation model with few annotations. We validate it on two publicly available LN segmentation datasets: SegRap2023 and LNQ2023. The results show that DGST outperforms existing few-shot fine-tuning methods, achieving satisfactory performance with limited labeled data. We release the dataset, models and all implementations to facilitate relevant research: https://github.com/Zihaoluoh/LN-Seg-FM.
Chinese: 本研究针对淋巴结分割在放疗和预后中因标注数据稀缺而受限的问题,通过大规模标注数据和创新的少样本微调方法,开发出优于现有技术的模型,有效利用有限标注数据实现高精度分割。
English: Accurate lymph node segmentation is essential for radiotherapy and prognosis but is hindered by the scarcity of annotated data, which this study addresses by developing a robust model using extensive annotations and a novel few-shot fine-tuning method that outperforms existing approaches with limited labeled data.

Authors:Fei Teng, Buyin Deng, Boyuan Zheng, Kai Luo, Kunyu Peng, Jiaming Zhang, Kailun Yang
Title: Unifying Light Field Perception with Field of Parallax
Abstract:
Field of Parallax (FoP)}, a spatial field that distills the common features from different LF representations to provide flexible and consistent support for multi-task learning. FoP is built upon three core features--projection difference, adjacency divergence, and contextual consistency--which are essential for cross-task adaptability. To implement FoP, we design a two-step angular adapter: the first step captures angular-specific differences, while the second step consolidates contextual consistency to ensure robust representation. Leveraging the FoP-based representation, we introduce the LFX framework, the first to handle arbitrary LF representations seamlessly, unifying LF multi-task vision. We evaluated LFX across three different tasks, achieving new state-of-the-art results, compared with previous task-specific architectures: 84.74% in mIoU for semantic segmentation on UrbanLF, 0.84% in AP for object detection on PKU, and 0.030 in MAE and 0.026 in MAE for salient object detection on Duftv2 and PKU, respectively. The source code will be made publicly available at https://github.com/warriordby/LFX.
中文: 视差场(FoP)通过核心特征构建了支持跨任务适应的空间场,基于此的LFX框架统一了光场多任务视觉,在语义分割、目标检测和显著目标检测任务中均取得了最优性能。
English: The Field of Parallax (FoP) introduces a spatial field with core features enabling cross-task adaptability, and the LFX framework built on it unifies light field multi-task vision, achieving state-of-the-art results across semantic segmentation, object detection, and salient object detection.

Authors:Teng Zhang, Hongxu Jiang, Kuang Gong, Wei Shao
Title: Geodesic Diffusion Models for Medical Image-to-Image Generation
Abstract:
Diffusion models transform an unknown data distribution into a Gaussian prior by progressively adding noise until the data become indistinguishable from pure noise. This stochastic process traces a path in probability space, evolving from the original data distribution (considered as a Gaussian with near-zero variance) to an isotropic Gaussian. The denoiser then learns to reverse this process, generating high-quality samples from random Gaussian noise. However, standard diffusion models, such as the Denoising Diffusion Probabilistic Model (DDPM), do not ensure a geodesic (i.e., shortest) path in probability space. This inefficiency necessitates the use of many intermediate time steps, leading to high computational costs in training and sampling. To address this limitation, we propose the Geodesic Diffusion Model (GDM), which defines a geodesic path under the Fisher-Rao metric with a variance-exploding noise scheduler. This formulation transforms the data distribution into a Gaussian prior with minimal energy, significantly improving the efficiency of diffusion models. We trained GDM by continuously sampling time steps from 0 to 1 and using as few as 15 evenly spaced time steps for model sampling. We evaluated GDM on two medical image-to-image generation tasks: CT image denoising and MRI image super-resolution. Experimental results show that GDM achieved state-of-the-art performance while reducing training time by a 50-fold compared to DDPM and 10-fold compared to Fast-DDPM, with 66 times faster sampling than DDPM and a similar sampling speed to Fast-DDPM. These efficiency gains enable rapid model exploration and real-time clinical applications. Our code is publicly available at: https://github.com/mirthAI/GDM-VE.
Chinese: 测地扩散模型(GDM)在Fisher-Rao度量下引入测地路径以提升扩散模型效率,在医学影像任务中实现最优性能,相比标准模型大幅减少了训练和采样时间。
English: The Geodesic Diffusion Model (GDM) introduces a geodesic path under the Fisher-Rao metric to enhance diffusion model efficiency, achieving state-of-the-art performance in medical imaging tasks with significantly reduced training and sampling times compared to standard models.

Authors:Henrui Tian, Wenhui Lei, Linrui Dai, Hanyu Chen, Xiaofan Zhang
Title: LesionDiffusion: Towards Text-controlled General Lesion Synthesis
Abstract:
Fully-supervised lesion recognition methods in medical imaging face challenges due to the reliance on large annotated datasets, which are expensive and difficult to collect. To address this, synthetic lesion generation has become a promising approach. However, existing models struggle with scalability, fine-grained control over lesion attributes, and the generation of complex structures. We propose LesionDiffusion, a text-controllable lesion synthesis framework for 3D CT imaging that generates both lesions and corresponding masks. By utilizing a structured lesion report template, our model provides greater control over lesion attributes and supports a wider variety of lesion types. We introduce a dataset of 1,505 annotated CT scans with paired lesion masks and structured reports, covering 14 lesion types across 8 organs. LesionDiffusion consists of two components: a lesion mask synthesis network (LMNet) and a lesion inpainting network (LINet), both guided by lesion attributes and image features. Extensive experiments demonstrate that LesionDiffusion significantly improves segmentation performance, with strong generalization to unseen lesion types and organs, outperforming current state-of-the-art models. Code is available at https://github.com/HengruiTianSJTU/LesionDiffusion.
中文: LesionDiffusion是一种基于文本控制的3D CT病灶合成框架,通过结构化属性调控生成病灶和对应掩模,显著提升了分割性能并展现出对未知病灶类型的强泛化能力。
English: LesionDiffusion is a text-controllable framework for generating synthetic 3D CT lesions and masks, enhancing segmentation performance and generalization across diverse lesion types through structured attribute control.

Authors:Jinjiang You, Hewei Wang, Yijie Li, Mingxiao Huo, Long Van Tran Ha, Mingyuan Ma, Jinfeng Xu, Jiayi Zhang, Puzhen Wu, Shubham Garg, Wei Pu
Title: Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration
Abstract:
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything
中文摘要:本文提出一种基于密集特征的多帧标定方法,可直接从场景数据优化相机内参,通过引入外参正则化、密集特征重投影和内参方差约束,在无需专门标定采集的情况下达到与传统方法相近的精度,显著提升了三维重建精度。
English Summary: This paper introduces a dense-feature-driven multi-frame calibration method that refines camera intrinsics directly from scene data, eliminating dedicated calibration captures while achieving comparable precision through enhanced SfM pipelines with specialized regularization terms.

Authors:Wenhui Lei, Anqi Li, Yusheng Tan, Hanyu Chen, Xiaofan Zhang
Title: Shazam: Unifying Multiple Foundation Models for Advanced Computational Pathology
Abstract:
Foundation Models (FMs) in computational pathology (CPath) have significantly advanced the extraction of meaningful features from histopathology image datasets, achieving strong performance across various clinical tasks. Despite their impressive performance, these models often exhibit variability when applied to different tasks, prompting the need for a unified framework capable of consistently excelling across various applications. In this work, we propose Shazam, a novel framework designed to efficiently combine multiple CPath models. Unlike previous approaches that train a fixed-parameter FM, Shazam dynamically extracts and refines information from diverse FMs for each specific task. To ensure that each FM contributes effectively without dominance, a novel distillation strategy is applied, guiding the student model with features from all teacher models, which enhances its generalization ability. Experimental results on two pathology patch classification datasets demonstrate that Shazam outperforms existing CPath models and other fusion methods. Its lightweight, flexible design makes it a promising solution for improving CPath analysis in real-world settings. Code will be available at https://github.com/Tuner12/Shazam.
Chinese: Shazam框架通过动态特征提取和新型蒸馏策略,有效整合多个计算病理学基础模型,在病理分类任务中展现出卓越的性能和泛化能力。
English: The Shazam framework efficiently integrates multiple computational pathology foundation models through dynamic feature extraction and a novel distillation strategy, demonstrating superior performance and generalization in pathology classification tasks.

Authors:Yang Ding, Can Han, Sijia Du, Yaqi Wang, Dahong Qian
Title: LightEndoStereo: A Real-time Lightweight Stereo Matching Method for Endoscopy Images
Abstract:
Real-time acquisition of accurate depth of scene is essential for automated robotic minimally invasive surgery, and stereo matching with binocular endoscopy can generate such depth. However, existing algorithms struggle with ambiguous tissue boundaries and real-time performance in prevalent high-resolution endoscopic scenes. We propose LightEndoStereo, a lightweight real-time stereo matching method for endoscopic images. We introduce a 3D Mamba Coordinate Attention module to streamline the cost aggregation process by generating position-sensitive attention maps and capturing long-range dependencies across spatial dimensions using the Mamba block. Additionally, we introduce a High-Frequency Disparity Optimization module to refine disparity estimates at tissue boundaries by enhancing high-frequency information in the wavelet domain. Our method is evaluated on the SCARED and SERV-CT datasets, achieving state-of-the-art matching accuracy and a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/LightEndoStereo.
中文: LightEndoStereo是一种用于内窥镜图像的轻量级实时立体匹配方法,通过引入3D曼巴坐标注意力模块和高频视差优化模块,在SCARED和SERV-CT数据集上实现了最先进的匹配精度和42 FPS的实时推理速度。
English: LightEndoStereo is a lightweight real-time stereo matching method for endoscopic images that introduces a 3D Mamba Coordinate Attention module and a High-Frequency Disparity Optimization module, achieving state-of-the-art accuracy and 42 FPS inference speed on SCARED and SERV-CT datasets.

Authors:Changlin Song, Jiaqi Wang, Liyun Zhu, He Weng
Title: Enhancing Monocular 3D Scene Completion with Diffusion Model
Abstract:
3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving, enabling machines to understand and interact with complex environments. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance, but this dependence limits their use in scenarios where only a single image is available. In this work, we introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image, significantly reducing the need for multi-view inputs. Our approach leverages a pre-trained vision-language model to generate descriptive prompts for the scene, guiding a diffusion model to produce images from various perspectives, which are then fused to form a cohesive 3D reconstruction. Extensive experiments show that our method effectively and robustly expands single-image inputs into a comprehensive 3D scene, extending monocular 3D reconstruction capabilities without further training. Our code is available https://github.com/CharlieSong1999/FlashDreamer/tree/main.
Chinese: FlashDreamer提出了一种从单张图像重建完整3D场景的创新方法,通过视觉语言模型和扩散模型生成多视角图像,显著减少对多视角输入的依赖,无需额外训练即可扩展单目3D重建能力。
English: FlashDreamer introduces a novel method for reconstructing complete 3D scenes from single images by using vision-language models and diffusion models to generate multi-view images, significantly reducing reliance on multi-view inputs and extending monocular 3D reconstruction capabilities without additional training.

Authors:Seungbae Seo, Junghwan Kim, Minjeong Shin, Bongwon Suh
Title: LLMDR: LLM-Driven Deadlock Detection and Resolution in Multi-Agent Pathfinding
Abstract:
Multi-Agent Pathfinding (MAPF) is a core challenge in multi-agent systems. Existing learning-based MAPF methods often struggle with scalability, particularly when addressing complex scenarios that are prone to deadlocks. To address these challenges, we introduce LLMDR (LLM-Driven Deadlock Detection and Resolution), an approach designed to resolve deadlocks and improve the performance of learnt MAPF models. LLMDR integrates the inference capabilities of large language models (LLMs) with learnt MAPF models and prioritized planning, enabling it to detect deadlocks and provide customized resolution strategies. We evaluate LLMDR on standard MAPF benchmark maps with varying agent numbers, measuring its performance when combined with several base models. The results demonstrate that LLMDR improves the performance of learnt MAPF models, particularly in deadlock-prone scenarios, with notable improvements in success rates. These findings show the potential of integrating LLMs to improve the scalability of learning-based MAPF methods. The source code for LLMDR is available at: https://github.com/ssbacc/llmdr-dhc
Chinese: LLMDR是一种创新方法,它将大型语言模型与学习的多智能体路径规划模型相结合,能够检测并解决死锁,在复杂场景中显著提升了性能和可扩展性。
English: LLMDR is a novel approach that integrates large language models with learned multi-agent pathfinding (MAPF) models to detect and resolve deadlocks, significantly improving performance and scalability in complex scenarios.

Authors:Zhiqi Kang, Liyuan Wang, Xingxing Zhang, Karteek Alahari
Title: Advancing Prompt-Based Methods for Replay-Independent General Continual Learning
Abstract:
General continual learning (GCL) is a broad concept to describe real-world continual learning (CL) problems, which are often characterized by online data streams without distinct transitions between tasks, i.e., blurry task boundaries. Such requirements result in poor initial performance, limited generalizability, and severe catastrophic forgetting, heavily impacting the effectiveness of mainstream GCL models trained from scratch. While the use of a frozen pretrained backbone with appropriate prompt tuning can partially address these challenges, such prompt-based methods remain suboptimal for CL of remaining tunable parameters on the fly. In this regard, we propose an innovative approach named MISA (Mask and Initial Session Adaption) to advance prompt-based methods in GCL. It includes a forgetting-aware initial session adaption that employs pretraining data to initialize prompt parameters and improve generalizability, as well as a non-parametric logit mask of the output layers to mitigate catastrophic forgetting. Empirical results demonstrate substantial performance gains of our approach compared to recent competitors, especially without a replay buffer (e.g., up to 18.39%, 22.06%, and 11.96% performance lead on CIFAR-100, Tiny-ImageNet, and ImageNet-R, respectively). Moreover, our approach features the plug-in nature for prompt-based methods, independence of replay, ease of implementation, and avoidance of CL-relevant hyperparameters, serving as a strong baseline for GCL research. Our source code is publicly available at https://github.com/kangzhiq/MISA
中文:MISA方法通过预训练数据初始化提示参数提升泛化能力,并采用非参数化逻辑掩码减轻灾难性遗忘,在无需回放缓冲区的情况下显著提升了通用持续学习的性能。
English: The MISA method enhances general continual learning by using pretraining data to initialize prompts for better adaptability and a logit mask to reduce forgetting, achieving significant performance improvements without a replay buffer.

Authors:Ashish Verma, Aupendu Kar, Krishnendu Ghosh, Sobhan Kanti Dhara, Debashis Sen, Prabir Kumar Biswas
Title: Artificially Generated Visual Scanpath Improves Multi-label Thoracic Disease Classification in Chest X-Ray Images
Abstract:
Expert radiologists visually scan Chest X-Ray (CXR) images, sequentially fixating on anatomical structures to perform disease diagnosis. An automatic multi-label classifier of diseases in CXR images can benefit by incorporating aspects of the radiologists' approach. Recorded visual scanpaths of radiologists on CXR images can be used for the said purpose. But, such scanpaths are not available for most CXR images, which creates a gap even for modern deep learning based classifiers. This paper proposes to mitigate this gap by generating effective artificial visual scanpaths using a visual scanpath prediction model for CXR images. Further, a multi-class multi-label classifier framework is proposed that uses a generated scanpath and visual image features to classify diseases in CXR images. While the scanpath predictor is based on a recurrent neural network, the multi-label classifier involves a novel iterative sequential model with an attention module. We show that our scanpath predictor generates human-like visual scanpaths. We also demonstrate that the use of artificial visual scanpaths improves multi-class multi-label disease classification results on CXR images. The above observations are made from experiments involving around 0.2 million CXR images from 2 widely-used datasets considering the multi-label classification of 14 pathological findings. Code link: https://github.com/ashishverma03/SDC
中文: 本文提出通过循环神经网络生成模拟放射科医生视觉扫描路径的人工轨迹,并结合图像特征构建新型多标签分类器,有效提升了胸部X光片的疾病检测性能。
English: This paper proposes generating artificial visual scanpaths using a recurrent neural network to mimic radiologists' eye movements, which are then integrated with image features in a novel multi-label classifier to improve disease detection in chest X-rays.

Authors:Siddhartha Gairola, Moritz Böhle, Francesco Locatello, Bernt Schiele
Title: How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations
Abstract:
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.
Chinese: 本研究揭示,解释深度神经网络的事后重要性归因方法受分类层训练细节的显著影响,挑战了其与模型训练无关的假设,并提出简单调整以提升解释质量。
English: This study reveals that post-hoc importance attribution methods for explaining deep neural networks are significantly influenced by the training details of the classification layer, challenging the assumption of independence from model training and offering simple adjustments to improve explanation quality.

Authors:Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla
Title: Efficiently Editing Mixture-of-Experts Models with Compressed Experts
Abstract:
Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to model performance, with some providing minimal utility, particularly when finetuning pretrained MoE models for specialized downstream tasks. The co-existence of significant and redundant parameters in experts provides us an opportunity to reduce the number of activated experts while maintaining model performance. In this work, we propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts. The reduction of active parameters significantly lowers inference costs while achieving comparable performance. Extensive experiments on models including Phi-MoE and OLMoE demonstrate that compressed experts recover over 90% of full expert performance across various tasks while reducing more than 30% active parameters and saving 20% in inference costs. This approach enables efficient deployment of MoE models in resource-constrained settings and facilitates scaling to larger models with manageable overhead. Our code is available at https://github.com/yifei-he/Compressed-Experts.
中文: 本研究提出压缩专家概念,通过轻量级模块替换混合专家模型中的辅助专家,在保持90%以上任务性能的同时,显著减少激活参数并降低20%推理成本。
English: This study introduces compressed experts, lightweight modules that replace auxiliary experts in Mixture-of-Experts models to significantly reduce active parameters and inference costs while maintaining over 90% performance across tasks.

Authors:Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen
Title: SolidMark: Evaluating Image Memorization in Generative Models
Abstract:
Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not. This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at https://github.com/NickyDCFP/SolidMark.
Chinese: 本文提出SolidMark这一新型评估方法,通过提供逐图像及像素级记忆分数解决扩散模型现有记忆度量偏差问题,重新评估缓解技术并发布模型以推动生成模型记忆机制研究。
English: This paper introduces SolidMark, a novel evaluation method that addresses biases in existing memorization metrics for diffusion models by providing per-image and pixel-level memorization scores, and re-evaluates mitigation techniques while releasing models to advance research.

Authors:Tuğrul Hasan Karabulut, İnci M. Baytaş
Title: Channel-Attentive Graph Neural Networks
Abstract:
Graph Neural Networks (GNNs) set the state-of-the-art in representation learning for graph-structured data. They are used in many domains, from online social networks to complex molecules. Most GNNs leverage the message-passing paradigm and achieve strong performances on various tasks. However, the message-passing mechanism used in most models suffers from over-smoothing as a GNN's depth increases. The over-smoothing degrades GNN's performance due to the increased similarity between the representations of unrelated nodes. This study proposes an adaptive channel-wise message-passing approach to alleviate the over-smoothing. The proposed model, Channel-Attentive GNN, learns how to attend to neighboring nodes and their feature channels. Thus, much diverse information can be transferred between nodes during message-passing. Experiments with widely used benchmark datasets show that the proposed model is more resistant to over-smoothing than baselines and achieves state-of-the-art performances for various graphs with strong heterophily. Our code is at https://github.com/ALLab-Boun/CHAT-GNN.
Chinese: 本研究提出了一种通道注意力图神经网络,通过在消息传递过程中自适应关注相邻节点和特征通道,有效缓解了过平滑问题,并在异质图数据上实现了最优性能。
English: This study introduces a Channel-Attentive Graph Neural Network that adaptively attends to neighboring nodes and feature channels during message-passing, effectively mitigating over-smoothing and achieving state-of-the-art performance on heterophilic graphs.

Authors:Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang
Title: LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning
Abstract:
In recent years, pretrained large language models have demonstrated outstanding performance across various natural language processing tasks. However, full-parameter fine-tuning methods require adjusting all model parameters, leading to immense computational resource demands. Although parameter-efficient fine-tuning methods like LoRA have significantly reduced the number of parameters, they still face challenges such as gradient vanishing and the potential for further parameter reduction. To address these issues, this paper proposes a novel parameter-efficient fine-tuning method called LoR2C (Low-Rank Residual Connection Adaptation). LoR2C introduces residual connections with low-rank matrices within the model layers, which not only reduces the number of fine-tuning parameters but also effectively alleviates the gradient vanishing problem. Additionally, this paper presents three optimization variants of LoR2C: ShareLoR2C, MergeLoR2C, and InjectLoR2C. These variants further improve parameter efficiency and model performance through parameter sharing, module merging, and injection mechanisms, respectively. Experimental results on multiple natural language understanding and natural language generation tasks demonstrate that LoR2C and its optimized variants significantly reduce parameter overhead while maintaining or even improving performance, outperforming existing mainstream parameter-efficient fine-tuning methods.Our code is publicly available at https://github.com/Oblivioniss/LoR2C.
中文: 本文提出新型参数高效微调方法LoR2C,通过低秩残差连接减少参数并缓解梯度消失问题,其优化变体在多项自然语言处理任务中显著提升参数效率与模型性能。
English: This paper introduces LoR2C, a novel parameter-efficient fine-tuning method that uses low-rank residual connections to reduce parameters and mitigate gradient vanishing, with optimized variants further enhancing efficiency and performance across NLP tasks.

Authors:Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo
Title: ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models
Abstract:
Tool-Augmented Language Models (TALMs) leverage external APIs to answer user queries across various domains. However, existing benchmark datasets for TALM research often feature simplistic dialogues that do not reflect real-world scenarios, such as the need for models to ask clarifying questions or proactively call additional APIs when essential information is missing. To address these limitations, we construct and release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI. ToolDial has two key characteristics. First, the dialogues incorporate 16 user and system actions (e.g., "Request", "Clarify", "Fail inform") to capture the rich dynamics of real-world interactions. Second, we simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information. To facilitate this process, we introduce a method for generating an API graph that represents input and output compatibility between APIs. Using ToolDial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement. We release our dataset and code at https://github.com/holi-lab/ToolDial.
Chinese: ToolDial是一个包含11,111个多轮对话的新数据集,通过模拟真实用户-系统交互和API兼容性图来解决现有工具增强语言模型基准的不足,当前模型在预测正确操作和参数方面的准确率低于70%。
English: ToolDial is a new dataset of 11,111 multi-turn dialogues designed to address the limitations of existing benchmarks for Tool-Augmented Language Models by incorporating realistic user-system interactions and API compatibility graphs, with current models scoring below 70% accuracy in predicting correct actions and parameters.

Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, Ling Liu
Title: Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
Abstract:
Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.
中文: 大型推理模型的安全对齐能恢复其安全能力,但会降低推理性能,这种权衡被称为“安全税”,同时提出的DirectRefusal数据集可作为对齐的替代资源。
English: Safety alignment for Large Reasoning Models (LRMs) restores safety but reduces reasoning ability, revealing a trade-off termed "Safety Tax," and introduces the DirectRefusal dataset as a resource for alignment.

Authors:Zhixin Zhang, Wenzhi Bai, Liang Zhao, Pawel Ladosz
Title: PL-VIWO: A Lightweight and Robust Point-Line Monocular Visual Inertial Wheel Odometry
Abstract:
This paper presents a novel tightly coupled Filter-based monocular visual-inertial-wheel odometry (VIWO) system for ground robots, designed to deliver accurate and robust localization in long-term complex outdoor navigation scenarios. As an external sensor, the camera enhances localization performance by introducing visual constraints. However, obtaining a sufficient number of effective visual features is often challenging, particularly in dynamic or low-texture environments. To address this issue, we incorporate the line features for additional geometric constraints. Unlike traditional approaches that treat point and line features independently, our method exploits the geometric relationships between points and lines in 2D images, enabling fast and robust line matching and triangulation. Additionally, we introduce Motion Consistency Check (MCC) to filter out potential dynamic points, ensuring the effectiveness of point feature updates. The proposed system was evaluated on publicly available datasets and benchmarked against state-of-the-art methods. Experimental results demonstrate superior performance in terms of accuracy, robustness, and efficiency. The source code is publicly available at: https://github.com/Happy-ZZX/PL-VIWO
中文摘要:本文提出一种紧耦合的单目视觉-惯性-轮式里程计系统,通过结合点线特征和运动一致性检测,在复杂户外环境中实现了精确鲁棒的定位性能。
English Summary: This paper introduces a tightly coupled monocular visual-inertial-wheel odometry system that integrates both point and line features with motion consistency checks to achieve accurate, robust localization in challenging outdoor environments.

Authors:Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang
Title: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Abstract:
We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.
Chinese: ReKV提出了一种无需训练的高效流式视频问答方法,通过结合视频大语言模型、滑动窗口注意力机制和键值缓存管理,实现了实时响应并显著提升了计算效率。
English: ReKV introduces a training-free method for efficient streaming video question-answering by integrating with Video-LLMs, utilizing sliding-window attention and key-value cache management to enable real-time responses and improved computational efficiency.

Authors:Haofei Lu, Dongqi Han, Yifei Shen, Dongsheng Li
Title: What Makes a Good Diffusion Planner for Decision Making?
Abstract:
Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.
Chinese: 本研究通过评估6000多个模型,系统分析了离线强化学习中的扩散规划,发现无条件采样加选择和Transformer网络等非常规设计优于传统方法,从而提出了新的最先进基准。
English: This study systematically analyzes diffusion planning in offline reinforcement learning by evaluating over 6,000 models, revealing that unconventional design choices like unconditional sampling with selection and Transformer networks outperform common practices, leading to a new state-of-the-art baseline.

Authors:Haofei Lu, Zhe Wu, Junliang Xing, Jianshu Li, Ruoyu Li, Zhe Li, Yuanchun Shi
Title: BodyGen: Advancing Towards Efficient Embodiment Co-Design
Abstract:
Embodiment co-design aims to optimize a robot's morphology and control policy simultaneously. While prior work has demonstrated its potential for generating environment-adaptive robots, this field still faces persistent challenges in optimization efficiency due to the (i) combinatorial nature of morphological search spaces and (ii) intricate dependencies between morphology and control. We prove that the ineffective morphology representation and unbalanced reward signals between the design and control stages are key obstacles to efficiency. To advance towards efficient embodiment co-design, we propose BodyGen, which utilizes (1) topology-aware self-attention for both design and control, enabling efficient morphology representation with lightweight model sizes; (2) a temporal credit assignment mechanism that ensures balanced reward signals for optimization. With our findings, Body achieves an average 60.03% performance improvement against state-of-the-art baselines. We provide codes and more results on the website: https://genesisorigin.github.io.
Chinese: BodyGen采用拓扑感知自注意力机制和时间信用分配,显著提升了具身协同设计的效率,相比现有方法性能平均提高60.03%。
English: BodyGen introduces a topology-aware self-attention mechanism and temporal credit assignment to enhance embodiment co-design efficiency, achieving a 60.03% performance gain over existing methods.

Authors:Wanli Hong, Yuliang Shi, Jonathan Niles-Weed
Title: Trajectory Inference with Smooth Schrödinger Bridges
Abstract:
Motivated by applications in trajectory inference and particle tracking, we introduce Smooth Schrödinger Bridges. Our proposal generalizes prior work by allowing the reference process in the Schrödinger Bridge problem to be a smooth Gaussian process, leading to more regular and interpretable trajectories in applications. Though naïvely smoothing the reference process leads to a computationally intractable problem, we identify a class of processes (including the Matérn processes) for which the resulting Smooth Schrödinger Bridge problem can be lifted to a simpler problem on phase space, which can be solved in polynomial time. We develop a practical approximation of this algorithm that outperforms existing methods on numerous simulated and real single-cell RNAseq datasets. The code can be found at https://github.com/WanliHongC/Smooth_SB
Chinese: 本文提出平滑薛定谔桥方法,通过采用平滑高斯过程生成更规则的轨迹,在单细胞RNAseq数据上表现优异,并提供了计算高效的实现方案。
English: This paper introduces Smooth Schrödinger Bridges, a method that enhances trajectory inference by using smooth Gaussian processes for more regular paths and demonstrates superior performance on single-cell RNAseq data with a computationally efficient solution.

Authors:Jiawen Zhu, Huayi Tang, Xin Chen, Xinying Wang, Dong Wang, Huchuan Lu
Title: Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking
Abstract:
Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named \textbf{AsymTrack} for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8\% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0\% AUC with higher speeds. The code is available at https://github.com/jiawen-zhu/AsymTrack.
中文: AsymTrack提出了一种非对称孪生跟踪器,通过分离模板和搜索流来减少冗余并提升效率,在不同平台上实现了优越的速度与精度平衡。
English: AsymTrack introduces an asymmetric Siamese tracker that separates template and search streams to reduce redundancy and enhance efficiency, achieving superior speed-precision trade-offs across various platforms.

Authors:Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, Jianke Zhu
Title: Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Abstract:
Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Source code is available at https://github.com/hanxunyu/Inst3D-LMM
中文: 本文提出Inst3D-LMM统一模型,通过多视图跨模态融合注入2D语义至3D几何特征,并利用3D实例空间关系模块捕捉物体间复杂空间关联,在多项3D场景理解任务中实现最优性能。
English: This paper introduces Inst3D-LMM, a unified model that enhances 3D scene understanding by integrating multi-view 2D semantics with 3D geometric features and capturing spatial relationships among objects, achieving state-of-the-art performance across multiple tasks.

Authors:Zhuo Ouyang, Kaiwen Hu, Qi Zhang, Yifei Wang, Yisen Wang
Title: Projection Head is Secretly an Information Bottleneck
Abstract:
Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. Empirically, our methods exhibit consistent improvement in the downstream performance across various real-world datasets, including CIFAR-10, CIFAR-100, and ImageNet-100. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field. Code is available at https://github.com/PKU-ML/Projector_Theory.
Chinese: 本文从信息论角度理论分析了对比学习中投影头的作用,揭示其作为信息瓶颈过滤无关信息的机制,并提出正则化方法在多个数据集上实现了性能提升。
English: This paper provides a theoretical analysis of the projection head in contrastive learning, revealing its role as an information bottleneck that filters irrelevant data, and proposes regularization methods that improve performance across multiple datasets.

Authors:Shiyu Fang, Jiaqi Liu, Chengkai Xu, Chen Lv, Peng Hang, Jian Sun
Title: Interact, Instruct to Improve: A LLM-Driven Parallel Actor-Reasoner Framework for Enhancing Autonomous Vehicle Interactions
Abstract:
Autonomous Vehicles (AVs) have entered the commercialization stage, but their limited ability to interact and express intentions still poses challenges in interactions with Human-driven Vehicles (HVs). Recent advances in large language models (LLMs) enable bidirectional human-machine communication, but the conflict between slow inference speed and the need for real-time decision-making challenges practical deployment. To address these issues, this paper introduces a parallel Actor-Reasoner framework designed to enable explicit bidirectional AV-HV interactions across multiple scenarios. First, by facilitating interactions between the LLM-driven Reasoner and heterogeneous simulated HVs during training, an interaction memory database, referred to as the Actor, is established. Then, by introducing the memory partition module and the two-layer memory retrieval module, the Actor's ability to handle heterogeneous HVs is significantly enhanced. Ablation studies and comparisons with other decision-making methods demonstrate that the proposed Actor-Reasoner framework significantly improves safety and efficiency. Finally, with the combination of the external Human-Machine Interface (eHMI) information derived from Reasoner's reasoning and the feasible action solutions retrieved from the Actor, the effectiveness of the proposed Actor-Reasoner is confirmed in multi-scenario field interactions. Our code is available at https://github.com/FanGShiYuu/Actor-Reasoner.
中文: 本文提出了一种并行执行者-推理者框架,通过结合基于交互记忆数据库的实时决策与大语言模型驱动的推理,显著提升了自动驾驶汽车与人类驾驶车辆在多场景交互中的安全性和效率。
English: This paper introduces a parallel Actor-Reasoner framework that enhances autonomous vehicle interactions with human-driven vehicles by combining real-time decision-making through an interaction memory database with LLM-driven reasoning, significantly improving safety and efficiency across multiple scenarios.

Authors:Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie
Title: LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Abstract:
Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.
中文: LLaSE-G1是一种基于LLaMA的语言模型,通过采用WavLM的连续表示和预测X-Codec2标记来解决语音增强中的声学不一致问题,同时通过双通道输入无需任务特定标识即可实现跨任务的泛化能力。
English: LLaSE-G1 is a LLaMA-based language model that addresses acoustic inconsistency in speech enhancement by using continuous representations from WavLM and predicting X-Codec2 tokens, while enabling generalization across multiple tasks through dual-channel inputs without task-specific IDs.

Authors:Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee
Title: PodAgent: A Comprehensive Framework for Podcast Generation
Abstract:
Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.
中文: 本文提出PodAgent框架,通过多智能体内容生成、语音角色匹配和增强型语音合成技术,有效解决了播客类音频自动生成的难题,并在实验中显著超越了GPT-4的直接生成效果。
English: This paper introduces PodAgent, a framework that overcomes challenges in automatic podcast generation through multi-agent content creation, voice matching, and expressive speech synthesis, demonstrating superior performance over direct GPT-4 generation.

Authors:Lixu Wang, Bingqi Shang, Yi Li, Payal Mohapatra, Wei Dong, Xiao Wang, Qi Zhu
Title: Split Adaptation for Pre-trained Vision Transformers
Abstract:
Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at https://github.com/conditionWang/Split_Adaptation.
中文: 提出的分割适配方法将视觉Transformer分割为前端和后端组件,通过量化和噪声注入在少样本下游任务中同时保护数据隐私和模型知识产权,同时保持高性能并降低客户端计算成本。
English: The proposed split adaptation method segments Vision Transformers into frontend and backend components, using quantization and noise injection to protect both data privacy and model intellectual property during few-shot downstream tasks, while maintaining high performance and low client computation costs.

Authors:Jingyi Yang, Xun Lin, Zitong Yu, Liepiao Zhang, Xin Liu, Hui Li, Xiaochen Yuan, Xiaochun Cao
Title: DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing
Abstract:
With the availability of diverse sensor modalities (i.e., RGB, Depth, Infrared) and the success of multi-modal learning, multi-modal face anti-spoofing (FAS) has emerged as a prominent research focus. The intuition behind it is that leveraging multiple modalities can uncover more intrinsic spoofing traces. However, this approach presents more risk of misalignment. We identify two main types of misalignment: (1) \textbf{Intra-domain modality misalignment}, where the importance of each modality varies across different attacks. For instance, certain modalities (e.g., Depth) may be non-defensive against specific attacks (e.g., 3D mask), indicating that each modality has unique strengths and weaknesses in countering particular attacks. Consequently, simple fusion strategies may fall short. (2) \textbf{Inter-domain modality misalignment}, where the introduction of additional modalities exacerbates domain shifts, potentially overshadowing the benefits of complementary fusion. To tackle (1), we propose a alignment module between modalities based on mutual information, which adaptively enhances favorable modalities while suppressing unfavorable ones. To address (2), we employ a dual alignment optimization method that aligns both sub-domain hyperplanes and modality angle margins, thereby mitigating domain gaps. Our method, dubbed \textbf{D}ual \textbf{A}lignment of \textbf{D}omain and \textbf{M}odality (DADM), achieves state-of-the-art performance in extensive experiments across four challenging protocols demonstrating its robustness in multi-modal domain generalization scenarios. The codes will be released soon.
中文摘要:多模态人脸防伪技术利用多种传感器检测伪造痕迹,但面临模态内和模态间不对齐问题,提出的DADM方法通过互信息和双重对齐优化解决了这些问题,实现了最优性能。
English Summary: Multi-modal face anti-spoofing leverages diverse sensors to detect spoofing traces but faces intra-domain and inter-domain misalignment issues, which are addressed by the proposed DADM method using mutual information and dual alignment optimization to achieve state-of-the-art performance.

Authors:Magnus Cunow, Gerrit Großmann
Title: Auto-encoding Molecules: Graph-Matching Capabilities Matter
Abstract:
Autoencoders are effective deep learning models that can function as generative models and learn latent representations for downstream tasks. The use of graph autoencoders - with both encoder and decoder implemented as message passing networks - is intriguing due to their ability to generate permutation-invariant graph representations. However, this approach faces difficulties because decoding a graph structure from a single vector is challenging, and comparing input and output graphs requires an effective permutation-invariant similarity measure. As a result, many studies rely on approximate methods. In this work, we explore the effect of graph matching precision on the training behavior and generation capabilities of a Variational Autoencoder (VAE). Our contribution is two-fold: (1) we propose a transformer-based message passing graph decoder as an alternative to a graph neural network decoder, that is more robust and expressive by leveraging global attention mechanisms. (2) We show that the precision of graph matching has significant impact on training behavior and is essential for effective de novo (molecular) graph generation. Code is available at https://github.com/mcunow/graph-matching
中文摘要:本研究揭示了图匹配精度对变分自编码器训练的关键影响,并提出了基于Transformer的解码器,显著提升了分子图生成的鲁棒性与表达能力。
English summary: This study demonstrates that precise graph matching is crucial for training variational autoencoders and introduces a transformer-based decoder that enhances robustness and expressiveness in molecular graph generation.

Authors:Xin Lin, Chong Shi, Zuopeng Yang, Haojin Tang, Zhili Zhou
Title: SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
Abstract:
Recent open-vocabulary human-object interaction (OV-HOI) detection methods primarily rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories. Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. First, we introduce a granularity sensing alignment module that aggregates global semantic features with local details, refining interaction representations and ensuring robust alignment between intermediate visual features and text embeddings. Second, we develop a hierarchical group comparison module that recursively compares and groups classes using LLMs, generating fine-grained and discriminative descriptions for each interaction category. Experimental results on two widely-used benchmark datasets, SWIG-HOI and HICO-DET, demonstrate that our method achieves state-of-the-art results in OV-HOI detection. Codes will be released on https://github.com/Phil0212/SGC-Net.
Chinese: 针对现有开放词汇人-物交互检测方法存在的特征粒度不足和语义相似性混淆问题,我们提出了分层粒度比较网络,通过粒度感知对齐和层次化分组比较模块,在基准数据集上取得了最优性能。
English: Recent OV-HOI detection methods face challenges of feature granularity deficiency and semantic similarity confusion, which are addressed by our proposed stratified granular comparison network through granularity sensing alignment and hierarchical group comparison modules, achieving state-of-the-art results on benchmark datasets.

Authors:Zhaoyi Tian, Feifeng Wang, Shiwei Wang, Zihao Zhou, Yao Zhu, Liquan Shen
Title: High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
Abstract:
Recently, learned video compression (LVC) is undergoing a period of rapid development. However, due to absence of large and high-quality high dynamic range (HDR) video training data, LVC on HDR video is still unexplored. In this paper, we are the first to collect a large-scale HDR video benchmark dataset, named HDRVD2K, featuring huge quantity, diverse scenes and multiple motion types. HDRVD2K fills gaps of video training data and facilitate the development of LVC on HDR videos. Based on HDRVD2K, we further propose the first learned bit-depth scalable video compression (LBSVC) network for HDR videos by effectively exploiting bit-depth redundancy between videos of multiple dynamic ranges. To achieve this, we first propose a compression-friendly bit-depth enhancement module (BEM) to effectively predict original HDR videos based on compressed tone-mapped low dynamic range (LDR) videos and dynamic range prior, instead of reducing redundancy only through spatio-temporal predictions. Our method greatly improves the reconstruction quality and compression performance on HDR videos. Extensive experiments demonstrate the effectiveness of HDRVD2K on learned HDR video compression and great compression performance of our proposed LBSVC network. Code and dataset will be released in https://github.com/sdkinda/HDR-Learned-Video-Coding.
中文: 本文提出了首个大规模HDR视频数据集HDRVD2K,并设计了一种基于比特深度冗余的端到端可伸缩视频压缩网络,通过创新的比特深度增强模块显著提升了HDR视频的重建质量与压缩性能。
English: This paper introduces HDRVD2K, the first large-scale HDR video dataset, and proposes a learned bit-depth scalable video compression network that significantly enhances HDR video reconstruction and compression by exploiting bit-depth redundancy.

Authors:Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu
Title: Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
Abstract:
Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.
中文摘要:提出的查询推理方法通过从截图和坐标推断用户查询,弥合了GUI基础与推理之间的差距,在极少训练数据下显著超越现有技术。
English Summary: The proposed query inference method bridges the gap between GUI grounding and reasoning by inferring user queries from screenshots and coordinates, significantly outperforming previous techniques with minimal training data.

Authors:Song Xia, Yi Yu, Wenhan Yang, Meiwen Ding, Zhuo Chen, Ling-Yu Duan, Alex C. Kot, Xudong Jiang
Title: Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
Abstract:
By locally encoding raw data into intermediate features, collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers. However, recent studies have revealed that these intermediate features may not sufficiently preserve privacy, as information can be leaked and raw data can be reconstructed via model inversion attacks (MIAs). Obfuscation-based methods, such as noise corruption, adversarial representation learning, and information filters, enhance the inversion robustness by obfuscating the task-irrelevant redundancy empirically. However, methods for quantifying such redundancy remain elusive, and the explicit mathematical relation between this redundancy minimization and inversion robustness enhancement has not yet been established. To address that, this work first theoretically proves that the conditional entropy of inputs given intermediate features provides a guaranteed lower bound on the reconstruction mean square error (MSE) under any MIA. Then, we derive a differentiable and solvable measure for bounding this conditional entropy based on the Gaussian mixture estimation and propose a conditional entropy maximization (CEM) algorithm to enhance the inversion robustness. Experimental results on four datasets demonstrate the effectiveness and adaptability of our proposed CEM; without compromising feature utility and computing efficiency, plugging the proposed CEM into obfuscation-based defense mechanisms consistently boosts their inversion robustness, achieving average gains ranging from 12.9\% to 48.2\%. Code is available at \href{https://github.com/xiasong0501/CEM}{https://github.com/xiasong0501/CEM}.
Chinese: 协作推理通过将原始数据编码为中间特征来保护隐私,但这些特征仍可能遭受模型逆向攻击;本研究提出条件熵最大化算法,在不影响功能与效率的前提下,有效增强隐私保护能力。
English: Collaborative inference protects raw data by encoding it into intermediate features, but these can still be vulnerable to model inversion attacks, which this study addresses by proposing a conditional entropy maximization algorithm to enhance privacy without sacrificing utility or efficiency.

Authors:Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai
Title: MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Abstract:
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.
Chinese: MIRROR是一种新颖的多模态学习方法,通过平衡模态对齐与特征保留,有效整合组织病理学和转录组学数据,在癌症亚型分类和生存分析中展现出卓越性能。
English: MIRROR is a novel multi-modal learning method that effectively integrates histopathology and transcriptomics data by balancing modality alignment with the preservation of unique features, demonstrating superior performance in cancer subtyping and survival analysis.

Authors:Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Title: Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Abstract:
Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.
中文: Octopus框架通过动态识别幻觉类型并自适应调整对比解码策略,有效减少大型视觉语言模型的虚构生成,在多项基准测试中超越现有方法。
English: The Octopus framework dynamically identifies hallucination types and adapts contrastive decoding strategies to effectively mitigate fabrication in Large Vision-Language Models, outperforming existing methods across multiple benchmarks.

Authors:Yunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen Wang
Title: U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack
Abstract:
Recent advancements in Large Language Models (LLMs) have expanded their context windows to unprecedented lengths, sparking debates about the necessity of Retrieval-Augmented Generation (RAG). To address the fragmented evaluation paradigms and limited cases in existing Needle-in-a-Haystack (NIAH), this paper introduces U-NIAH, a unified framework that systematically compares LLMs and RAG methods in controlled long context settings. Our framework extends beyond traditional NIAH by incorporating multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings, while leveraging the synthetic Starlight Academy dataset-a fictional magical universe-to eliminate biases from pre-trained knowledge. Through extensive experiments, we investigate three research questions: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG's limitations in complex settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness, achieving an 82.58% win-rate over LLMs. However, we observe that retrieval noise and reverse chunk ordering degrade performance, while surprisingly, advanced reasoning LLMs exhibit reduced RAG compatibility due to sensitivity to semantic distractors. We identify typical error patterns including omission due to noise, hallucination under high noise critical condition, and self-doubt behaviors. Our work not only highlights the complementary roles of RAG and LLMs, but also provides actionable insights for optimizing deployments. Code: https://github.com/Tongji-KGLLM/U-NIAH.
中文:本文提出的U-NIAH框架表明,检索增强生成(RAG)能显著提升小模型鲁棒性且胜率达82.58%,但同时也揭示了检索噪声会导致性能下降,以及高级推理模型存在兼容性减弱的问题。
English: This paper introduces U-NIAH, a unified framework demonstrating that RAG significantly enhances smaller LLMs' robustness with an 82.58% win-rate, while revealing performance degradation from retrieval noise and reduced compatibility in advanced reasoning models.

Authors:Samuel Garske, Konrad Heidler, Bradley Evans, KC Wong, Xiao Xiang Zhu
Title: SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping
Abstract:
The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards. The Python code is available at: https://github.com/WiseGamgee/SHAZAM
中文: 本研究提出SHAZAM自监督系统,通过轻量级条件UNet建模季节性变化,并对比生成图像与卫星观测来检测环境灾害,在多种数据集上实现了显著性能提升。
English: This work introduces SHAZAM, a self-supervised system that uses a lightweight conditional UNet to model seasonal changes and detect environmental hazards by comparing generated images with satellite observations, achieving significant performance improvements across diverse datasets.

Authors:Zihan Huang, Wei Fang, Tong Bu, Peng Xue, Zecheng Hao, Wenxuan Liu, Yuanhong Tang, Zhaofei Yu, Tiejun Huang
Title: Differential Coding for Training-Free ANN-to-SNN Conversion
Abstract:
Spiking Neural Networks (SNNs) exhibit significant potential due to their low energy consumption. Converting Artificial Neural Networks (ANNs) to SNNs is an efficient way to achieve high-performance SNNs. However, many conversion methods are based on rate coding, which requires numerous spikes and longer time-steps compared to directly trained SNNs, leading to increased energy consumption and latency. This article introduces differential coding for ANN-to-SNN conversion, a novel coding scheme that reduces spike counts and energy consumption by transmitting changes in rate information rather than rates directly, and explores its application across various layers. Additionally, the threshold iteration method is proposed to optimize thresholds based on activation distribution when converting Rectified Linear Units (ReLUs) to spiking neurons. Experimental results on various Convolutional Neural Networks (CNNs) and Transformers demonstrate that the proposed differential coding significantly improves accuracy while reducing energy consumption, particularly when combined with the threshold iteration method, achieving state-of-the-art performance. The source codes of the proposed method are available at https://github.com/h-z-h-cell/ANN-to-SNN-DCGS.
中文: 本文提出用于人工神经网络到脉冲神经网络转换的差分编码方法,通过传输速率变化而非直接速率来减少脉冲数量和能耗,并结合阈值迭代法优化神经元阈值,在多种网络结构上实现了精度和能效的最优性能。
English: This article introduces differential coding for ANN-to-SNN conversion, which reduces spike counts and energy consumption by transmitting rate changes instead of direct rates, and proposes a threshold iteration method to optimize neuron thresholds, achieving state-of-the-art performance in accuracy and efficiency across various networks.

Authors:Xinwei Luo, Songlin Zhao, Yun Zong, Yong Chen, Gui-shuang Ying, Lifang He
Title: SegImgNet: Segmentation-Guided Dual-Branch Network for Retinal Disease Diagnoses
Abstract:
Retinal image plays a crucial role in diagnosing various diseases, as retinal structures provide essential diagnostic information. However, effectively capturing structural features while integrating them with contextual information from retinal images remains a challenge. In this work, we propose segmentation-guided dual-branch network for retinal disease diagnosis using retinal images and their segmentation maps, named SegImgNet. SegImgNet incorporates a segmentation module to generate multi-scale retinal structural feature maps from retinal images. The classification module employs two encoders to independently extract features from segmented images and retinal images for disease classification. To further enhance feature extraction, we introduce the Segmentation-Guided Attention (SGA) block, which leverages feature maps from the segmentation module to refine the classification process. We evaluate SegImgNet on the public AIROGS dataset and the private e-ROP dataset. Experimental results demonstrate that SegImgNet consistently outperforms existing methods, underscoring its effectiveness in retinal disease diagnosis. The code is publicly available at https://github.com/hawk-sudo/SegImgNet.
中文摘要:提出的SegImgNet通过分割引导的双分支网络和专用注意力机制,有效整合视网膜图像的结构特征与上下文信息,在基准数据集上展现出卓越的疾病诊断性能。
English Summary: The proposed SegImgNet utilizes segmentation-guided dual-branch networks with a specialized attention mechanism to effectively integrate structural and contextual features from retinal images, demonstrating superior diagnostic performance on benchmark datasets.

Authors:Milad Yazdani, Yasamin Medghalchi, Pooria Ashrafian, Ilker Hacihaliloglu, Dena Shahriari
Title: Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality
Abstract:
Deep learning models have emerged as a powerful tool for various medical applications. However, their success depends on large, high-quality datasets that are challenging to obtain due to privacy concerns and costly annotation. Generative models, such as diffusion models, offer a potential solution by synthesizing medical images, but their practical adoption is hindered by long inference times. In this paper, we propose the use of an optimal transport flow matching approach to accelerate image generation. By introducing a straighter mapping between the source and target distribution, our method significantly reduces inference time while preserving and further enhancing the quality of the outputs. Furthermore, this approach is highly adaptable, supporting various medical imaging modalities, conditioning mechanisms (such as class labels and masks), and different spatial dimensions, including 2D and 3D. Beyond image generation, it can also be applied to related tasks such as image enhancement. Our results demonstrate the efficiency and versatility of this framework, making it a promising advancement for medical imaging applications. Code with checkpoints and a synthetic dataset (beneficial for classification and segmentation) is now available on: https://github.com/milad1378yz/MOTFM.
中文: 本文提出一种最优传输流匹配方法,通过构建更直接的源-目标分布映射来加速医学图像生成,在显著减少推理时间的同时提升输出质量,并支持多种影像模态与任务的高适应性应用。
English: This paper introduces an optimal transport flow matching approach that accelerates medical image generation by creating straighter source-to-target mappings, significantly reducing inference time while improving output quality and adaptability across various imaging modalities and tasks.

Authors:Guangsheng Bao, Lihua Rong, Yanbin Zhao, Qiji Zhou, Yue Zhang
Title: Decoupling Content and Expression: Two-Dimensional Detection of AI-Generated Text
Abstract:
The wide usage of LLMs raises critical requirements on detecting AI participation in texts. Existing studies investigate these detections in scattered contexts, leaving a systematic and unified approach unexplored. In this paper, we present HART, a hierarchical framework of AI risk levels, each corresponding to a detection task. To address these tasks, we propose a novel 2D Detection Method, decoupling a text into content and language expression. Our findings show that content is resistant to surface-level changes, which can serve as a key feature for detection. Experiments demonstrate that 2D method significantly outperforms existing detectors, achieving an AUROC improvement from 0.705 to 0.849 for level-2 detection and from 0.807 to 0.886 for RAID. We release our data and code at https://github.com/baoguangsheng/truth-mirror.
中文: 本文提出HART分层框架,通过将文本解构为内容和语言表达的新型二维检测方法,在AI文本识别中显著超越现有检测器,AUROC最高提升至0.849。
English: This paper introduces HART, a hierarchical framework for detecting AI-generated text through a novel 2D method that analyzes content and language separately, significantly outperforming existing detectors with AUROC improvements up to 0.849.

Authors:Samar M. Magdy, Sang Yun Kwon, Fakhraddin Alwajih, Safaa Abdelfadil, Shady Shehata, Muhammad Abdul-Mageed
Title: Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking
Abstract:
Recent advancements in instruction fine-tuning, alignment methods such as reinforcement learning from human feedback (RLHF), and optimization techniques like direct preference optimization (DPO) have significantly enhanced the adaptability of large language models (LLMs) to user preferences. However, despite these innovations, many LLMs continue to exhibit biases toward Western, Anglo-centric, or American cultures, with performance on English data consistently surpassing that of other languages. This reveals a persistent cultural gap in LLMs, which complicates their ability to accurately process culturally rich and diverse figurative language such as proverbs. To address this, we introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. Jawaher includes proverbs from various Arabic dialects, along with idiomatic translations and explanations. Through extensive evaluations of both open- and closed-source models, we find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations. These findings highlight the need for ongoing model refinement and dataset expansion to bridge the cultural gap in figurative language processing.
中文:尽管大语言模型在用户偏好适应方面取得进展,但文化偏见依然存在,Jawaher基准测试显示模型对阿拉伯谚语能生成准确翻译,却难以提供文化上细致入微的解释。
English: Recent advancements in LLM alignment have improved user adaptability, yet models still exhibit cultural biases, particularly in processing Arabic proverbs, as revealed by the Jawaher benchmark showing limitations in cultural nuance despite accurate translations.

Authors:Melih İşeri, Erhan Bayraktar
Title: The Learning Approach to Games
Abstract:
This work introduces a unified framework for a more detailed exploration of games. In existing literature, strategies of players are typically assigned scalar values, and the concept of Nash equilibrium is used to identify compatible strategies. However, this approach lacks the internal structure of a player, thereby failing to accurately model observed behaviors. To address this limitation, we propose an abstract definition of a player. This allows for a more nuanced understanding of players and brings the focus to the challenge of learning that players face. Unlike Markov decision processes, which formalize control problems but not agent design, our framework subsumes standard reinforcement learning structures. It thus offers a language that enables a deeper connection between games and learning. To illustrate the need for such generality, we study a simple two-player game and show that even in the most basic settings, a sophisticated player may adopt dynamic strategies that cannot be captured by simpler designs or compatibility analysis alone. In the discrete setting, we consider a player whose structure incorporates standard estimates from the literature. We explore connections to correlated equilibrium and highlight that dynamic programming naturally applies to all estimates. In the mean-field setting, we exploit symmetry to construct explicit examples of equilibria. Finally, we examine connections to reinforcement learning and bandit problems, demonstrating the broad applicability of the framework.
中文: 本文提出了一个统一框架,通过重新定义具有内部结构的玩家来更精确地模拟博弈行为,将博弈论与学习过程相连接,并展示了该框架在相关均衡和强化学习等多种场景中的广泛适用性。
English: This paper presents a unified framework that redefines players with internal structures to better model strategic behaviors in games, bridging game theory with learning processes and demonstrating its applicability across various settings including correlated equilibrium and reinforcement learning.

Authors:K. O. T. Erziev
Title: À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
Abstract:
We report a peculiar observation that LLMs can assign hidden meanings to sequences that seem visually incomprehensible to humans: for example, a nonsensical phrase consisting of Byzantine musical symbols is recognized by gpt-4o as "say abracadabra". Moreover, some models can communicate using these sequences. Some of these meanings are hypothesized to partly originate in the massive spurious correlations due to BPE tokenization. We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 mini, GigaChat-Max, Vikhr-Llama-3.2 1B. We argue that this observation might have far-reaching consequences for both safety and security of the modern and future LLMs and systems that employ them. As an illustration, we show that applying this method in combination with simple templates is sufficient to jailbreak previous generation models, with ASR = 0.4 on gpt-4o mini. Our code and data artifacts are available at https://github.com/L3G5/llm-hidden-meanings
中文: 大型语言模型能够为视觉上难以理解的序列赋予隐藏含义,这可能源于分词相关性,并引发安全隐患,如其可被用于越狱模型的能力所示。
English: Large language models can assign hidden meanings to visually incomprehensible sequences, potentially due to tokenization correlations, raising safety concerns as demonstrated by their ability to jailbreak models.

Authors:Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, Jiawei Han
Title: DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning
Abstract:
Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.
中文摘要:DeepRetrieval提出了一种无需监督数据的强化学习方法,通过训练大语言模型生成查询,在多个检索领域实现了最先进的性能,且比现有方法更加高效。
English Summary: DeepRetrieval introduces a reinforcement learning approach that trains Large Language Models to generate queries without supervised data, achieving state-of-the-art retrieval performance across multiple search domains while being more efficient than existing methods.

Authors:Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li
Title: SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
Abstract:
Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light $\implies$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.
中文总结:SafeAuto是一种创新框架,通过结合位置相关损失函数、概率逻辑推理与多模态检索增强生成技术,将多模态大语言模型与结构化安全知识相融合,从而显著提升自动驾驶系统的安全性与可靠性。
English Summary: SafeAuto is a novel framework that enhances autonomous driving safety by integrating multimodal large language models with structured safety knowledge through position-dependent loss functions, probabilistic reasoning, and multimodal retrieval-augmented generation.

Authors:Naveen Mysore
Title: Quantifying First-Order Markov Violations in Noisy Reinforcement Learning: A Causal Discovery Approach
Abstract:
Reinforcement learning (RL) methods frequently assume that each new observation completely reflects the environment's state, thereby guaranteeing Markovian (one-step) transitions. In practice, partial observability or sensor/actuator noise often invalidates this assumption. This paper proposes a systematic methodology for detecting such violations, combining a partial correlation-based causal discovery process (PCMCI) with a novel Markov Violation score (MVS). The MVS measures multi-step dependencies that emerge when noise or incomplete state information disrupts the Markov property. Classic control tasks (CartPole, Pendulum, Acrobot) serve as examples to illustrate how targeted noise and dimension omissions affect both RL performance and measured Markov consistency. Surprisingly, even substantial observation noise sometimes fails to induce strong multi-lag dependencies in certain domains (e.g., Acrobot). In contrast, dimension-dropping investigations show that excluding some state variables (e.g., angular velocities in CartPole and Pendulum) significantly reduces returns and increases MVS, while removing other dimensions has minimal impact. These findings emphasize the importance of locating and safeguarding the most causally essential dimensions in order to preserve effective single-step learning. By integrating partial correlation tests with RL performance outcomes, the proposed approach precisely identifies when and where the Markov assumption is violated. This framework offers a principled mechanism for developing robust policies, informing representation learning, and addressing partial observability in real-world RL scenarios. All code and experimental logs are accessible for reproducibility (https://github.com/ucsb/markovianess).
中文: 本文提出一种结合因果发现与新型马尔可夫违背评分的方法,用于检测噪声或部分可观测性对强化学习中马尔可夫假设的破坏,并通过控制任务证明保护关键因果状态维度对维持有效学习至关重要。
English: This paper introduces a method combining causal discovery with a novel Markov Violation score to detect when noise or partial observability disrupts the Markov assumption in reinforcement learning, demonstrating through control tasks that protecting causally critical state dimensions is key to maintaining effective learning.

Authors:Jian Gao, Weidong Cao, Junyi Yang, Xuan Zhang
Title: AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies
Abstract:
The massive and large-scale design of foundational semiconductor integrated circuits (ICs) is crucial to sustaining the advancement of many emerging and future technologies, such as generative AI, 5G/6G, and quantum computing. Excitingly, recent studies have shown the great capabilities of foundational models in expediting the design of digital ICs. Yet, applying generative AI techniques to accelerate the design of analog ICs remains a significant challenge due to critical domain-specific issues, such as the lack of a comprehensive dataset and effective representation methods for analog circuits. This paper proposes, $\textbf{AnalogGenie}$, a $\underline{\textbf{Gen}}$erat$\underline{\textbf{i}}$ve $\underline{\textbf{e}}$ngine for automatic design/discovery of $\underline{\textbf{Analog}}$ circuit topologies--the most challenging and creative task in the conventional manual design flow of analog ICs. AnalogGenie addresses two key gaps in the field: building a foundational comprehensive dataset of analog circuit topology and developing a scalable sequence-based graph representation universal to analog circuits. Experimental results show the remarkable generation performance of AnalogGenie in broadening the variety of analog ICs, increasing the number of devices within a single design, and discovering unseen circuit topologies far beyond any prior arts. Our work paves the way to transform the longstanding time-consuming manual design flow of analog ICs to an automatic and massive manner powered by generative AI. Our source code is available at https://github.com/xz-group/AnalogGenie.
中文: 本文提出AnalogGenie生成式引擎,通过构建全面数据集和可扩展的电路表示方法,实现了模拟集成电路的自动化设计,大幅提升了电路多样性并突破了现有拓扑发现能力的局限。
English: The paper introduces AnalogGenie, a generative AI engine that automates the design of analog integrated circuits by creating a comprehensive dataset and scalable representation, significantly expanding circuit variety and enabling topology discovery beyond previous methods.

Authors:Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel
Title: PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion
Abstract:
Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures that are robust to the unique complexities posed by medical imaging data. Rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.
中文摘要:PRISM框架利用视觉-语言基础模型生成精准的语言引导医学图像反事实,能够选择性修改虚假相关性及疾病特征,从而开发出更鲁棒的分类器以供临床部署应用。
English Summary: The PRISM framework utilizes vision-language foundation models to generate precise, language-guided medical image counterfactuals, enabling selective modification of spurious correlations and disease features to develop more robust classifiers for clinical deployment.

Authors:Federico Pizarro Bejarano, Bryson Jones, Daniel Pastor Moreno, Joseph Bowkett, Paul G. Backes, Angela P. Schoellig
Title: ProDapt: Proprioceptive Adaptation using Long-term Memory Diffusion
Abstract:
Diffusion models have revolutionized imitation learning, allowing robots to replicate complex behaviours. However, diffusion often relies on cameras and other exteroceptive sensors to observe the environment and lacks long-term memory. In space, military, and underwater applications, robots must be highly robust to failures in exteroceptive sensors, operating using only proprioceptive information. In this paper, we propose ProDapt, a method of incorporating long-term memory of previous contacts between the robot and the environment in the diffusion process, allowing it to complete tasks using only proprioceptive data. This is achieved by identifying "keypoints", essential past observations maintained as inputs to the policy. We test our approach using a UR10e robotic arm in both simulation and real experiments and demonstrate the necessity of this long-term memory for task completion.
中文: ProDapt方法通过引入以往接触环境的关键点作为长期记忆,改进了模仿学习中的扩散模型,使机器人仅凭本体感知数据即可完成任务,并在UR10e机械臂的仿真和实际实验中验证了其必要性。
English: ProDapt enhances diffusion models in imitation learning by incorporating long-term memory of environmental contacts through keypoints, enabling robots to complete tasks using only proprioceptive data, as validated with a UR10e arm in simulations and real experiments.

Authors:Xinhang Ma, Junlin Wu, Hussein Sibai, Yiannis Kantaros, Yevgeniy Vorobeychik
Title: Learning Vision-Based Neural Network Controllers with Semi-Probabilistic Safety Guarantees
Abstract:
Ensuring safety in autonomous systems with vision-based control remains a critical challenge due to the high dimensionality of image inputs and the fact that the relationship between true system state and its visual manifestation is unknown. Existing methods for learning-based control in such settings typically lack formal safety guarantees. To address this challenge, we introduce a novel semi-probabilistic verification framework that integrates reachability analysis with conditional generative adversarial networks and distribution-free tail bounds to enable efficient and scalable verification of vision-based neural network controllers. Next, we develop a gradient-based training approach that employs a novel safety loss function, safety-aware data-sampling strategy to efficiently select and store critical training examples, and curriculum learning, to efficiently synthesize safe controllers in the semi-probabilistic framework. Empirical evaluations in X-Plane 11 airplane landing simulation, CARLA-simulated autonomous lane following, and F1Tenth lane following in a physical visually-rich miniature environment demonstrate the effectiveness of our method in achieving formal safety guarantees while maintaining strong nominal performance. Our code is available at https://github.com/xhOwenMa/SPVT.
中文: 本文提出了一种半概率验证框架和梯度训练方法,为基于视觉的神经网络控制器提供形式化安全保证,并在多个自主系统仿真中验证了其有效性。
English: This paper introduces a semi-probabilistic verification framework and a gradient-based training approach to provide formal safety guarantees for vision-based neural network controllers, with effectiveness demonstrated across multiple autonomous systems simulations.

Authors:Hanjiang Hu, Alexander Robey, Changliu Liu
Title: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Abstract:
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .
中文摘要:该研究提出了一种基于神经屏障函数的安全引导框架,通过主动检测多轮对话中的有害查询并维持持续安全状态,有效防御大语言模型的多轮越狱攻击。
English Summary: The proposed safety steering framework using a neural barrier function effectively prevents multi-turn jailbreaking attacks in large language models by proactively detecting harmful queries and ensuring invariant safety throughout dialogues.

Authors:Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro
Title: SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated
Abstract:
This technical report presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12 v1.0, the new version addresses the previous challenges of data misalignment and a limited data structure for low-barrier, analysis-ready EO processing. SSL4EO-S12 v1.1 covers the world's 10,000 largest cities and its surroundings within a 50 km radius across four seasons, resulting in a diverse collection of nearly one million patches. SSL4EO-S12 v1.1 packages the data in Zarr file format for cloud-efficient loading and representation of meta-information such as including cloud masks and geolocation. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://datapub.fz-juelich.de/ssl4eo-s12, and we provided additional resources at https://github.com/DLR-MF-DAS/SSL4EO-S12-v1.1.
中文:SSL4EO-S12 v1.1是一个增强型多模态地球观测数据集,覆盖全球万个城市四季影像,采用Zarr格式存储并开放授权,专为自监督学习设计。
English: SSL4EO-S12 v1.1 is an enhanced multimodal Earth Observation dataset covering 10,000 cities across four seasons, designed for self-supervised learning with cloud-optimized Zarr format and open licensing.

Authors:Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Title: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Abstract:
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
中文摘要:本研究引入了一个涵盖所有22个阿拉伯国家的社区驱动数据集,用于评估大型语言模型的文化和方言能力,揭示了它们在性能和代表性方面存在显著不足。
English Summary: This study introduces a comprehensive, community-driven dataset covering all 22 Arab countries to evaluate the cultural and dialectal capabilities of large language models, revealing significant limitations in their performance and representation.

Authors:Magnus Sesodia, Alina Petrova, John Armour, Thomas Lukasiewicz, Oana-Maria Camburu, Puneet K. Dokania, Philip Torr, Christian Schroeder de Witt
Title: AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction
Abstract:
Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.
中文:AnnoCaseLaw推出了首个包含471个美国上诉法院过失案件精细标注的数据集,旨在解决现有法律判决预测数据集的不足,通过定义三项法律任务并利用大语言模型建立性能基准,推动更贴近人类思维、可解释的人工智能模型发展。
English: AnnoCaseLaw introduces a novel dataset of 471 annotated U.S. Appeals Court negligence cases to address the limitations of existing Legal Judgment Prediction datasets, enabling more realistic and explainable AI models by defining three legal tasks and establishing performance baselines with large language models.

Authors:Xinyu Yuan, Zichen Wang, Marcus Collins, Huzefa Rangwala
Title: Protein Structure Tokenization: Benchmarking and New Recipe
Abstract:
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce StructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to develop AminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively. Source code and model weights are available at https://github.com/KatarinaYuan/StructTokenBench
Chinese: 本研究提出了StructTokenBench框架以评估蛋白质结构标记化方法,并开发了AminoAseed策略,通过优化码本利用显著提升了标记器性能,在监督任务中相比现有模型实现了明显改进。
English: The study introduces StructTokenBench, a framework for evaluating protein structure tokenization methods, and proposes AminoAseed, a strategy that improves tokenizer performance by enhancing codebook utilization, achieving significant gains over existing models in supervised tasks.

Authors:Zhenxing Cui, Lu Chen, Yunhai Wang, Daniel Haehn, Yong Wang, Hanspeter Pfister
Title: Generalization of CNNs on Relational Reasoning with Bar Charts
Abstract:
This paper presents a systematic study of the generalization of convolutional neural networks (CNNs) and humans on relational reasoning tasks with bar charts. We first revisit previous experiments on graphical perception and update the benchmark performance of CNNs. We then test the generalization performance of CNNs on a classic relational reasoning task: estimating bar length ratios in a bar chart, by progressively perturbing the standard visualizations. We further conduct a user study to compare the performance of CNNs and humans. Our results show that CNNs outperform humans only when the training and test data have the same visual encodings. Otherwise, they may perform worse. We also find that CNNs are sensitive to perturbations in various visual encodings, regardless of their relevance to the target bars. Yet, humans are mainly influenced by bar lengths. Our study suggests that robust relational reasoning with visualizations is challenging for CNNs. Improving CNNs' generalization performance may require training them to better recognize task-related visual properties.
中文: 本研究表明,尽管卷积神经网络在视觉编码一致的条形图关系推理任务中表现优于人类,但在视觉干扰下其性能显著下降,而人类主要关注条形长度,这凸显了卷积神经网络在此类任务中实现稳健泛化的挑战。
English: This study demonstrates that while CNNs can outperform humans in relational reasoning tasks with bar charts when visual encodings remain consistent, their performance significantly deteriorates under visual perturbations, unlike humans who focus primarily on bar lengths, highlighting the challenge of achieving robust generalization in CNNs for such tasks.

Authors:Chong Zhang, Yukun Ma, Qian Chen, Wen Wang, Shengkui Zhao, Zexu Pan, Hao Wang, Chongjia Ni, Trung Hieu Nguyen, Kun Zhou, Yidi Jiang, Chaohong Tan, Zhifu Gao, Zhihao Du, Bin Ma
Title: InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
Abstract:
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
中文:InspireMusic是一个结合超分辨率与大语言模型的统一框架,能够根据文本或音频提示生成高保真度的长篇幅音乐,通过优化音频分词器降低训练成本,在主观和客观评估中达到与顶尖开源系统相当的性能水平。
English: InspireMusic is a unified framework that integrates super-resolution and large language models to generate high-fidelity, long-form music from text or audio prompts, achieving up to 8 minutes of coherent audio with reduced training costs and competitive performance against leading systems.

Authors:Zezeng Li, Xiaoyu Du, Na Lei, Liming Chen, Weimin Wang
Title: NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary
Abstract:
Adversarial attacks exploit the vulnerability of deep models against adversarial samples. Existing point cloud attackers are tailored to specific models, iteratively optimizing perturbations based on gradients in either a white-box or black-box setting. Despite their promising attack performance, they often struggle to produce transferable adversarial samples due to overfitting the specific parameters of surrogate models. To overcome this issue, we shift our focus to the data distribution itself and introduce a novel approach named NoPain, which employs optimal transport (OT) to identify the inherent singular boundaries of the data manifold for cross-network point cloud attacks. Specifically, we first calculate the OT mapping from noise to the target feature space, then identify singular boundaries by locating non-differentiable positions. Finally, we sample along singular boundaries to generate adversarial point clouds. Once the singular boundaries are determined, NoPain can efficiently produce adversarial samples without the need of iterative updates or guidance from the surrogate classifiers. Extensive experiments demonstrate that the proposed end-to-end method outperforms baseline approaches in terms of both transferability and efficiency, while also maintaining notable advantages even against defense strategies. Code and model are available at https://github.com/cognaclee/nopain
中文总结:NoPain方法创新性地利用最优传输技术识别数据流形的固有奇异边界,无需迭代更新或依赖代理模型即可高效生成具有强迁移性的对抗性点云。
English summary: The NoPain method introduces a novel approach using optimal transport to identify singular boundaries in data manifolds, enabling efficient generation of highly transferable adversarial point clouds without iterative updates or reliance on surrogate models.

Authors:Qiusi Zhan, Richard Fang, Henil Shalin Panchal, Daniel Kang
Title: Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents
Abstract:
Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks. In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%. This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability. The code is available at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.
中文: 大语言模型代理面临间接提示注入攻击的重大安全风险,自适应攻击成功绕过了所有八种评估防御措施,成功率超过50%,揭示了当前防御的关键漏洞,并强调了设计防御时进行自适应攻击评估的必要性。
English: Large Language Model agents face significant security risks from indirect prompt injection attacks, as demonstrated by adaptive attacks that successfully bypass all eight evaluated defenses with over 50% success rates, highlighting critical vulnerabilities and the need for robust defense evaluations.

Authors:Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang
Title: VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.
中文: VOILA作为新型评估基准,通过要求多模态大语言模型生成完成视觉类比的图像来测试其抽象关系推理能力,结果显示模型在跨图像理解方面存在显著困难,准确率远低于人类水平。
English: VOILA is a novel benchmark that evaluates multimodal large language models' abstract relational reasoning by requiring them to generate images completing visual analogies, revealing their significant struggles in inter-image understanding with accuracy far below human performance.

Authors:Hongyi Cai, Yuqian Fu, Hongming Fu, Bo Zhao
Title: MergeIT: From Selection to Merging for Efficient Instruction Tuning
Abstract:
Instruction tuning is crucial for optimizing Large Language Models (LLMs), yet mainstream data selection methods heavily rely on LLMs as instruction quality scorers, leading to high computational costs and reduced data diversity. To address these limitations, we propose MergeIT, a novel LLM-based Merging strategy for better Instruction Tuning that shifts the focus from selection to synthesis. MergeIT operates in two stages: first, topic-aware filtering clusters and refines the dataset, preserving diversity while eliminating redundancy without relying on LLM-based scoring. Second, LLM-based merging synthesizes semantically similar instructions into more informative and compact training data, enhancing data richness while further reducing dataset size. Experimental results demonstrate that MergeIT enables efficient, diverse, and scalable instruction selection and synthesis, establishing LLM-based merging as a promising alternative to conventional scoring-based selection methods for instruction tuning. Our source code and datasets are now available at https://github.com/XcloudFance/MergeIT
中文摘要:MergeIT提出了一种基于大型语言模型的两阶段合并策略,通过主题感知过滤保留多样性并消除冗余,再合成语义相似的指令为更丰富紧凑的训练数据,实验证明其比传统评分选择方法更高效、可扩展。
English Summary: MergeIT introduces a two-stage LLM-based merging strategy for instruction tuning, which first filters and clusters data by topic to maintain diversity without LLM scoring, then synthesizes similar instructions into richer, more compact training sets, proving more efficient and scalable than traditional selection methods.

Authors:Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han
Title: KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
Abstract:
The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.
中文: 本研究推出了首个检测韩语大语言模型生成文本的基准数据集KatFish,并提出专门方法KatFishNet,通过分析间距和词性多样性等语言特征,其AUROC指标比现有最佳方法平均提高19.78%。
English: This study introduces KatFish, the first benchmark dataset for detecting LLM-generated Korean text, and proposes KatFishNet, a specialized detection method that achieves 19.78% higher AUROC than existing approaches by analyzing linguistic features like spacing and part-of-speech diversity.

Authors:Fengxiang Wang, Hongzhen Wang, Mingshuo Chen, Di Wang, Yulin Wang, Zonghao Guo, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, Maosong Sun
Title: XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Abstract:
The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500$\times$8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.
中文摘要:XLRS-Bench作为首个超高分遥感图像综合基准,通过16个子任务评估多模态大模型的感知与推理能力,其超大图像尺寸和精细标注揭示了现有模型在真实遥感应用中的不足,为未来发展指明方向。
English Summary: XLRS-Bench is introduced as a comprehensive benchmark with ultra-high-resolution remote sensing images to evaluate multimodal large language models' perception and reasoning capabilities, addressing limitations of existing benchmarks and revealing current models' inadequacies for real-world applications.

Authors:Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen
Title: A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
Abstract:
Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.
中文摘要:本文提出一个经验定律,可预测不同学习率策略下的预训练损失变化,并通过优化策略发现优于常用余弦衰减的新方案。
English Summary: This paper introduces an empirical law that predicts pretraining loss evolution across various learning rate schedules, enabling optimized schedule discovery that outperforms standard cosine decay.

Authors:Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Title: Cost-Optimal Grouped-Query Attention for Long-Context Modeling
Abstract:
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should also vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. More importantly, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3's GQA, with *no degradation in model capabilities*. Our findings offer valuable insights for designing efficient long-context LLMs. The code is available at https://www.github.com/THUNLP/cost-optimal-gqa .
中文: 分组查询注意力(GQA)配置应根据上下文长度优化,我们的方法通过减少注意力头并增大模型规模,可在保持性能的同时将内存和计算量降低超50%。
English: Grouped-Query Attention (GQA) configurations should adapt to context length for optimal efficiency, with our method enabling fewer heads and larger models to cut memory and FLOPs by over 50% without performance loss.

Authors:Yingfa Chen, Yutong Wu, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun
Title: Cost-Optimal Grouped-Query Attention for Long-Context Modeling
Abstract:
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should also vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. More importantly, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3's GQA, with *no degradation in model capabilities*. Our findings offer valuable insights for designing efficient long-context LLMs. The code is available at https://www.github.com/THUNLP/cost-optimal-gqa .
中文: 分组查询注意力(GQA)配置应根据上下文长度优化,我们的方法通过减少注意力头并增大模型规模,可在保持性能的同时将内存和计算量降低超50%。
English: Grouped-Query Attention (GQA) configurations should adapt to context length for optimal efficiency, with our method enabling fewer heads and larger models to cut memory and FLOPs by over 50% without performance loss.

Authors:Yinan Sun, Xiongkuo Min, Zicheng Zhang, Yixuan Gao, Yuqin Cao, Guangtao Zhai
Title: Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy
Abstract:
The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.
中文: 多模态大语言模型在统一视觉问答框架下虽取得进展,却因缺乏清晰自我认知而在低层视觉任务中产生幻觉,为此我们提出HLPU指令数据库和SAFEQA模型及ESA-PO框架,显著提升了任务准确性和模型自省能力。
English: Multimodal large language models, while advancing visual tasks through a unified question-answering framework, suffer from hallucinations in low-level vision due to poor self-awareness, prompting the introduction of the HLPU database and the SAFEQA model with ESA-PO framework to enhance accuracy and reduce such errors effectively.

Authors:Junlong Chen, Jiawen Kang, Minrui Xu, Fan Wu, Hongliang Zhang, Huawei Huang, Dusit Niyato, Shiwen Mao
Title: Efficient Twin Migration in Vehicular Metaverses: Multi-Agent Split Deep Reinforcement Learning with Spatio-Temporal Trajectory Generation
Abstract:
Vehicle Twins (VTs) as digital representations of vehicles can provide users with immersive experiences in vehicular metaverse applications, e.g., Augmented Reality (AR) navigation and embodied intelligence. VT migration is an effective way that migrates the VT when the locations of physical entities keep changing to maintain seamless immersive VT services. However, an efficient VT migration is challenging due to the rapid movement of vehicles, dynamic workloads of Roadside Units (RSUs), and heterogeneous resources of the RSUs. To achieve efficient migration decisions and a minimum latency for the VT migration, we propose a multi-agent split Deep Reinforcement Learning (DRL) framework combined with spatio-temporal trajectory generation. In this framework, multiple split DRL agents utilize split architecture to efficiently determine VT migration decisions. Furthermore, we propose a spatio-temporal trajectory generation algorithm based on trajectory datasets and road network data to simulate vehicle trajectories, enhancing the generalization of the proposed scheme for managing VT migration in dynamic network environments. Finally, experimental results demonstrate that the proposed scheme not only enhances the Quality of Experience (QoE) by 29% but also reduces the computational parameter count by approximately 25% while maintaining similar performances, enhancing users' immersive experiences in vehicular metaverses.
Chinese: 车辆数字孪生迁移通过多智能体分割深度强化学习框架和时空轨迹生成算法进行优化,在动态网络环境中不仅将用户体验质量提升了29%,还减少了约25%的计算参数,同时保持了沉浸式服务的连续性。
English: Vehicle Twins migration is optimized using a multi-agent split deep reinforcement learning framework and spatio-temporal trajectory generation, enhancing user experience by 29% and reducing computational parameters by 25% while maintaining seamless immersive services in dynamic vehicular environments.

Authors:Ruoyang Chen, Changyan Yi, Fuhui Zhou, Jiawen Kang, Yuan Wu, Dusit Niyato
Title: Federated Digital Twin Construction via Distributed Sensing: A Game-Theoretic Online Optimization with Overlapping Coalitions
Abstract:
In this paper, we propose a novel federated framework for constructing the digital twin (DT) model, referring to a living and self-evolving visualization model empowered by artificial intelligence, enabled by distributed sensing under edge-cloud collaboration. In this framework, the DT model to be built at the cloud is regarded as a global one being split into and integrating from multiple functional components, i.e., partial-DTs, created at various edge servers (ESs) using feature data collected by associated sensors. Considering time-varying DT evolutions and heterogeneities among partial-DTs, we formulate an online problem that jointly and dynamically optimizes partial-DT assignments from the cloud to ESs, ES-sensor associations for partial-DT creation, and as well as computation and communication resource allocations for global-DT integration. The problem aims to maximize the constructed DT's model quality while minimizing all induced costs, including energy consumption and configuration costs, in long runs. To this end, we first transform the original problem into an equivalent hierarchical game with an upper-layer two-sided matching game and a lower-layer overlapping coalition formation game. After analyzing these games in detail, we apply the Gale-Shapley algorithm and particularly develop a switch rules-based overlapping coalition formation algorithm to obtain short-term equilibria of upper-layer and lower-layer subgames, respectively. Then, we design a deep reinforcement learning-based solution, called DMO, to extend the result into a long-term equilibrium of the hierarchical game, thereby producing the solution to the original problem. Simulations show the effectiveness of the introduced framework, and demonstrate the superiority of the proposed solution over counterparts.
中文: 本文提出了一种新颖的联邦框架,通过边缘云协作构建自演化的数字孪生模型,采用分层博弈方法和深度强化学习来优化模型质量并最小化长期成本。
English: This paper introduces a federated framework for building a self-evolving digital twin model through edge-cloud collaboration, using a hierarchical game approach and deep reinforcement learning to optimize model quality while minimizing costs.

Authors:Xudong Wang, Jiacheng Wang, Lei Feng, Dusit Niyato, Ruichen Zhang, Jiawen Kang, Zehui Xiong, Hongyang Du, Shiwen Mao
Title: Wireless Hallucination in Generative AI-enabled Communications: Concepts, Issues, and Solutions
Abstract:
Generative AI (GenAI) is driving the intelligence of wireless communications. Due to data limitations, random generation, and dynamic environments, GenAI may generate channel information or optimization strategies that violate physical laws or deviate from actual real-world requirements. We refer to this phenomenon as wireless hallucination, which results in invalid channel information, spectrum wastage, and low communication reliability but remains underexplored. To address this gap, this article provides a comprehensive concept of wireless hallucinations in GenAI-driven communications, focusing on hallucination mitigation. Specifically, we first introduce the fundamental, analyze its causes based on the GenAI workflow, and propose mitigation solutions at the data, model, and post-generation levels. Then, we systematically examines representative hallucination scenarios in GenAI-enabled communications and their corresponding solutions. Finally, we propose a novel integrated mitigation solution for GenAI-based channel estimation. At the data level, we establish a channel estimation hallucination dataset and employ generative adversarial networks (GANs)-based data augmentation. Additionally, we incorporate attention mechanisms and large language models (LLMs) to enhance both training and inference performance. Experimental results demonstrate that the proposed hybrid solutions reduce the normalized mean square error (NMSE) by 0.19, effectively reducing wireless hallucinations.
中文摘要:生成式AI在无线通信中可能产生不准确的信道信息或策略,即无线幻觉,导致资源浪费和可靠性降低,但本研究提出了数据、模型及生成后多层次的缓解方案,通过数据增强和先进模型有效降低了误差。
English Summary: Generative AI in wireless communications can produce inaccurate channel data or strategies, known as wireless hallucination, leading to inefficiencies, but this study introduces multi-level mitigation techniques that significantly reduce errors through data augmentation and advanced models.

Authors:Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao
Title: Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG
Abstract:
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.
中文: 本文提出检索增强感知框架,通过动态检索融合图像片段并保持空间上下文,无需训练即可显著提升多模态大语言模型的高分辨率图像理解能力,在基准测试中性能提升最高达43%。
English: This paper introduces Retrieval-Augmented Perception (RAP), a training-free framework that enhances high-resolution image understanding in multimodal large language models by dynamically retrieving and fusing image crops while preserving spatial context, achieving up to 43% performance improvements on benchmarks.

Authors:Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, Kang Liu
Title: Probabilistic Uncertain Reward Model
Abstract:
Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
中文摘要:本文提出的概率不确定奖励模型(PURM)通过从偏好数据中学习奖励分布并利用分布重叠量化不确定性,有效解决了传统奖励模型的过度自信问题,在人类反馈强化学习中展现出更优的性能。
English Summary: This paper introduces the Probabilistic Uncertain Reward Model (PURM), which addresses overconfidence issues in conventional reward models by learning reward distributions from preference data and quantifying uncertainty through distribution overlap, demonstrating superior performance in reinforcement learning from human feedback.

Authors:Shuo Li, Jiajun Sun, Guodong Zheng, Xiaoran Fan, Yujiong Shen, Yi Lu, Zhiheng Xi, Yuming Yang, Wenming Tan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Abstract:
Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.
Chinese: 本文提出多频率扰动方法,通过干扰视觉特征并抑制冗余频域特征来减轻多模态大语言模型中的物体幻觉问题,在CHAIR基准测试中实现了最优性能。
English: This paper introduces Multi-Frequency Perturbations (MFP), a pluggable method that mitigates object hallucinations in multimodal large language models by perturbing visual features and suppressing redundant frequency-domain features, achieving state-of-the-art performance on the CHAIR benchmark.

Authors:Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Title: R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Abstract:
Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
中文: 现有大型推理模型常因依赖内部知识而难以应对时效性或知识密集型问题,但提出的R1-Searcher框架通过两阶段基于结果的强化学习方法,使大语言模型能自主调用外部系统增强搜索能力,显著超越了先前方法。
English: Existing large reasoning models often struggle with time-sensitive or knowledge-intensive questions due to reliance on internal knowledge, but the proposed R1-Searcher framework uses a two-stage outcome-based reinforcement learning approach to enhance LLMs' search capabilities by autonomously invoking external systems, significantly outperforming previous methods.

Authors:Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao
Title: Rewarding Curse: Analyze and Mitigate Reward Modeling Issues for LLM Reasoning
Abstract:
Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks. Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.
中文: 本文从有效性和忠实性角度分析思维链提示,识别关键性能影响因素并提出新算法,通过从问题中提取额外信息增强思维链生成,从而提升其可靠性和效果。
English: This paper analyzes chain-of-thought prompting's effectiveness and faithfulness, identifying key performance factors and proposing an algorithm that improves both aspects by recalling additional information from questions during CoT generation.

Authors:Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Rui Zheng, Nijun Li, Tao Gui, Yun Li, Qi Zhang, Xuanjing Huang
Title: Better Process Supervision with Bi-directional Rewarding Signals
Abstract:
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
中文: BiRM模型改进了过程监督,通过评估历史推理步骤和未来成功概率,在数学推理任务中表现优于现有方法。
English: Process supervision is enhanced by the BiRM model, which evaluates both past reasoning steps and future success probability, outperforming existing methods in mathematical reasoning tasks.

Authors:Adnan Shahid, Adrian Kliks, Ahmed Al-Tahmeesschi, Ahmed Elbakary, Alexandros Nikou, Ali Maatouk, Ali Mokh, Amirreza Kazemi, Antonio De Domenico, Athanasios Karapantelakis, Bo Cheng, Bo Yang, Bohao Wang, Carlo Fischione, Chao Zhang, Chaouki Ben Issaid, Chau Yuen, Chenghui Peng, Chongwen Huang, Christina Chaccour, Christo Kurisummoottil Thomas, Dheeraj Sharma, Dimitris Kalogiros, Dusit Niyato, Eli De Poorter, Elissa Mhanna, Emilio Calvanese Strinati, Faouzi Bader, Fathi Abdeldayem, Fei Wang, Fenghao Zhu, Gianluca Fontanesi, Giovanni Geraci, Haibo Zhou, Hakimeh Purmehdi, Hamed Ahmadi, Hang Zou, Hongyang Du, Hoon Lee, Howard H. Yang, Iacopo Poli, Igor Carron, Ilias Chatzistefanidis, Inkyu Lee, Ioannis Pitsiorlas, Jaron Fontaine, Jiajun Wu, Jie Zeng, Jinan Li, Jinane Karam, Johny Gemayel, Juan Deng, Julien Frison, Kaibin Huang, Kehai Qiu, Keith Ball, Kezhi Wang, Kun Guo, Leandros Tassiulas, Lecorve Gwenole, Liexiang Yue, Lina Bariah, Louis Powell, Marcin Dryjanski, Maria Amparo Canaveras Galdon, Marios Kountouris, Maryam Hafeez, Maxime Elkael, Mehdi Bennis, Mehdi Boudjelli, Meiling Dai, Merouane Debbah, Michele Polese, Mohamad Assaad, Mohamed Benzaghta, Mohammad Al Refai, Moussab Djerrab, Mubeen Syed, Muhammad Amir, Na Yan, Najla Alkaabi, Nan Li, Nassim Sehad, Navid Nikaein, Omar Hashash, Pawel Sroka, Qianqian Yang, Qiyang Zhao, Rasoul Nikbakht Silab, Rex Ying, Roberto Morabito, Rongpeng Li, Ryad Madi, Salah Eddine El Ayoubi, Salvatore D'Oro, Samson Lasaulce, Serveh Shalmashi, Sige Liu, Sihem Cherrared, Swarna Bindu Chetty, Swastika Dutta, Syed A. R. Zaidi, Tianjiao Chen, Timothy Murphy, Tommaso Melodia, Tony Q. S. Quek, Vishnu Ram, Walid Saad, Wassim Hamidouche, Weilong Chen, Xiaoou Liu, Xiaoxue Yu, Xijun Wang, Xingyu Shang, Xinquan Wang, Xuelin Cao, Yang Su, Yanping Liang, Yansha Deng, Yifan Yang, Yingping Cui, Yu Sun, Yuxuan Chen, Yvan Pointurier, Zeinab Nehme, Zeinab Nezami, Zhaohui Yang, Zhaoyang Zhang, Zhe Liu, Zhenyu Yang, Zhu Han, Zhuang Zhou, Zihan Chen, Zirui Chen, Zitao Shuai
Title: Large-Scale AI in Telecom: Charting the Roadmap for Innovation, Scalability, and Enhanced Digital Experiences
Abstract:
This white paper discusses the role of large-scale AI in the telecommunications industry, with a specific focus on the potential of generative AI to revolutionize network functions and user experiences, especially in the context of 6G systems. It highlights the development and deployment of Large Telecom Models (LTMs), which are tailored AI models designed to address the complex challenges faced by modern telecom networks. The paper covers a wide range of topics, from the architecture and deployment strategies of LTMs to their applications in network management, resource allocation, and optimization. It also explores the regulatory, ethical, and standardization considerations for LTMs, offering insights into their future integration into telecom infrastructure. The goal is to provide a comprehensive roadmap for the adoption of LTMs to enhance scalability, performance, and user-centric innovation in telecom networks.
本白皮书探讨了生成式AI和大型电信模型如何通过增强网络功能和用户体验来变革电信行业,同时为未来整合提出了监管与伦理方面的考量。
This white paper explores how generative AI and Large Telecom Models can transform telecommunications by enhancing network functions and user experiences, while addressing regulatory and ethical considerations for future integration.

Authors:Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang
Title: Learning to Instruct for Visual Instruction Tuning
Abstract:
We propose LIT, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, LIT adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, LIT achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, LIT attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs.
中文: LIT通过将损失函数融入指令和响应序列,有效提升了多模态大模型的性能,在基准测试中相对改进高达9%,并显著增强视觉能力、减少幻觉现象,无需额外数据或计算开销。
English: LIT enhances visual instruction tuning by integrating loss functions into both instruction and response sequences, significantly boosting multimodal performance by up to 9% on benchmarks and improving visual capabilities while reducing hallucinations without extra data or computational cost.

Authors:Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, Ya Zhang
Title: ChatBEV: A Visual Language Model that Understands BEV Maps
Abstract:
Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.
中文: ChatBEV-QA是一个包含超过13.7万个问题的新型BEV视觉问答基准,旨在通过全面任务提升交通场景理解能力,而ChatBEV是一个经过微调的模型,能够解读BEV地图并支持基于语言的交通场景生成。
English: ChatBEV-QA is a new BEV VQA benchmark with over 137k questions designed to enhance traffic scene understanding through comprehensive tasks, and ChatBEV is a fine-tuned model that interprets BEV maps and supports language-driven traffic scenario generation.

Authors:Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases
Abstract:
Recent advancements in reasoning-enhanced large language models (LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However, their application in professional medical contexts remains underexplored, particularly in evaluating the quality of their reasoning processes alongside final outputs. Here, we introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references derived from clinical case reports. Spanning 13 body systems and 10 specialties, it includes both common and rare diseases. To comprehensively evaluate LLM performance, we propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. To assess reasoning quality, we present the Reasoning Evaluator, a novel automated system that objectively scores free-text reasoning responses based on efficiency, actuality, and completeness using dynamic cross-referencing and evidence checks. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc. Our results show that current LLMs achieve over 85% accuracy in relatively simple diagnostic tasks when provided with sufficient examination results. However, performance declines in more complex tasks, such as examination recommendation and treatment planning. While reasoning outputs are generally reliable, with factuality scores exceeding 90%, critical reasoning steps are frequently missed. These findings underscore both the progress and limitations of clinical LLMs. Notably, open-source models like DeepSeek-R1 are narrowing the gap with proprietary systems, highlighting their potential to drive accessible and equitable advancements in healthcare.
中文:推理增强大语言模型在医学领域的应用尚待深入探索,为此我们推出了MedR-Bench基准数据集和评估框架,结果显示现有模型在简单诊断任务中准确率超过85%,但在复杂任务中表现下降,同时开源模型正逐步缩小与专有系统的差距。
English: Recent advancements in reasoning-enhanced LLMs show promise, yet their application in medical contexts remains underexplored, leading to the introduction of MedR-Bench, a comprehensive dataset and evaluation framework that reveals high accuracy in simple diagnostic tasks but performance declines in complex scenarios, with open-source models like DeepSeek-R1 closing the gap with proprietary systems.

Authors:Tengfei Zhang, Ziheng Zhao, Chaoyi Wu, Xiao Zhou, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining
Abstract:
Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images' across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.
中文摘要:本文提出一种利用密集放射学报告自动定义多粒度图像相似性的新方法,构建了两个全面的医学影像检索数据集,并开发出在多数指标上达到最先进水平的检索系统。
English Summary: This paper introduces a novel method using dense radiology reports to automatically define multi-granular image similarity, creating two comprehensive medical imaging retrieval datasets and developing retrieval systems that achieve state-of-the-art performance across most metrics.

Authors:YiQiu Guo, Yuchen Yang, Zhe Chen, Pingjie Wang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang
Title: DSVD: Dynamic Self-Verify Decoding for Faithful Generation in Large Language Models
Abstract:
The reliability of large language models remains a critical challenge, particularly due to their susceptibility to hallucinations and factual inaccuracies during text generation. Existing solutions either underutilize models' self-correction with preemptive strategies or use costly post-hoc verification. To further explore the potential of real-time self-verification and correction, we present Dynamic Self-Verify Decoding (DSVD), a novel decoding framework that enhances generation reliability through real-time hallucination detection and efficient error correction. DSVD integrates two key components: (1) parallel self-verification architecture for continuous quality assessment, (2) dynamic rollback mechanism for targeted error recovery. Extensive experiments across five benchmarks demonstrate DSVD's effectiveness, achieving significant improvement in truthfulness (Quesetion-Answering) and factual accuracy (FActScore). Results show the DSVD can be further incorporated with existing faithful decoding methods to achieve stronger performance. Our work establishes that real-time self-verification during generation offers a viable path toward more trustworthy language models without sacrificing practical deployability.
中文摘要:本研究提出动态自验证解码(DSVD)框架,通过实时幻觉检测和错误纠正机制提升大语言模型的生成可靠性,在多项基准测试中显著提高了真实性与事实准确性。
English Summary: The study introduces Dynamic Self-Verify Decoding (DSVD), a novel framework that enhances large language model reliability through real-time hallucination detection and error correction, demonstrating significant improvements in truthfulness and factual accuracy across benchmarks.

Authors:Kazuhiro Miyama, Kento Kawaharazuka, Kei Okada, Masayuki Inaba
Title: Development of a Five-Fingerd Biomimetic Soft Robotic Hand by 3D Printing the Skin and Skeleton as One Unit
Abstract:
Robot hands that imitate the shape of the human body have been actively studied, and various materials and mechanisms have been proposed to imitate the human body. Although the use of soft materials is advantageous in that it can imitate the characteristics of the human body's epidermis, it increases the number of parts and makes assembly difficult in order to perform complex movements. In this study, we propose a skin-skeleton integrated robot hand that has 15 degrees of freedom and consists of four parts. The developed robotic hand is mostly composed of a single flexible part produced by a 3D printer, and while it can be easily assembled, it can perform adduction, flexion, and opposition of the thumb, as well as flexion of four fingers.
中文: 本研究提出了一种具有15个自由度的皮肤骨骼一体化机械手,主要通过3D打印制成单一柔性部件,便于组装并能实现拇指内收、弯曲、对掌及四指弯曲的复杂动作。
English: This study introduces a 15-degree-of-freedom robotic hand with an integrated skin-skeleton design, primarily 3D-printed as a single flexible component for easy assembly and capable of complex thumb and finger movements.

Authors:Panpan Wang, Liqiang Niu, Fandong Meng, Jinan Xu, Yufeng Chen, Jie Zhou
Title: D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
Abstract:
In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.
中文摘要:提出的D2C模型通过两阶段生成方法融合离散令牌的粗粒度特征与连续令牌的细粒度特征,在ImageNet-256数据集上实现了优于现有模型的图像生成质量。
English Summary: The proposed D2C model combines discrete tokens for coarse features and continuous tokens for fine details through a two-stage generation process, achieving superior image quality on ImageNet-256 compared to existing methods.

Authors:Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Title: LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Abstract:
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
通用多模态嵌入模型在标准训练下难以区分困难负样本对,而提出的LLaVE框架通过动态增强表征学习,在多项任务中实现了最先进性能,并展现出对视频检索任务的强大泛化能力。
Universal multimodal embedding models face challenges in distinguishing hard negative pairs with standard training, but the proposed LLaVE framework dynamically enhances representation learning, achieving state-of-the-art performance across multiple tasks and demonstrating strong generalization to video retrieval.

Authors:Jiaxin Shen, Jinan Xu, Huiqi Hu, Luyi Lin, Fei Zheng, Guoyang Ma, Fandong Meng, Jie Zhou, Wenjuan Han
Title: A Law Reasoning Benchmark for LLM with Tree-Organized Structures including Factum Probandum, Evidence and Experiences
Abstract:
While progress has been made in legal applications, law reasoning, crucial for fair adjudication, remains unexplored. We propose a transparent law reasoning schema enriched with hierarchical factum probandum, evidence, and implicit experience, enabling public scrutiny and preventing bias. Inspired by this schema, we introduce the challenging task, which takes a textual case description and outputs a hierarchical structure justifying the final decision. We also create the first crowd-sourced dataset for this task, enabling comprehensive evaluation. Simultaneously, we propose an agent framework that employs a comprehensive suite of legal analysis tools to address the challenge task. This benchmark paves the way for transparent and accountable AI-assisted law reasoning in the ``Intelligent Court''.
中文摘要:本文提出了一种具有层次结构的透明法律推理框架,以确保公正和可问责性,并针对“智慧法院”中的AI辅助法律分析,设计了一项挑战性任务、创建了数据集并开发了智能体框架。
English Summary: This paper introduces a transparent law reasoning framework with a hierarchical structure to ensure fairness and accountability, proposing a challenging task, creating a dataset, and developing an agent framework for AI-assisted legal analysis in the "Intelligent Court."

Authors:Kepeng Wu, Zecheng Li, Hezhen Hu, Wengang Zhou, Houqiang Li
Title: Cross-Modal Consistency Learning for Sign Language Recognition
Abstract:
Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminates background perturbation but inevitably suffers from insufficient semantic cues compared to raw RGB videos. Nevertheless, learning representation directly from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.
中文: 提出的跨模态一致性学习框架(CCL-SLR)通过对比学习和创新技术对齐RGB与姿态模态,有效提升孤立手语识别性能,在多个基准测试中表现优异。
English: The proposed Cross-modal Consistency Learning framework (CCL-SLR) enhances isolated sign language recognition by aligning RGB and pose modalities through contrastive learning and novel techniques, achieving superior performance on benchmarks.

Authors:Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, Houqiang Li
Title: DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
Abstract:
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.
中文: 本文提出DesignDiffusion框架,通过端到端的扩散模型直接从文本描述生成风格统一的设计图像,采用字符嵌入优化和自博弈微调策略,在设计中实现了最先进的文本-视觉合成效果。
English: This paper introduces DesignDiffusion, an end-to-end diffusion framework that synthesizes design images with consistent text and visuals from prompts, achieving top performance through enhanced character embedding and optimization strategies.

Authors:Zhicheng Lee, Shulin Cao, Jinxin Liu, Jiajie Zhang, Weichuan Liu, Xiaoyin Che, Lei Hou, Juanzi Li
Title: ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
Abstract:
Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG's strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).
中文: ReaRAG通过引入检索增强生成和限制推理链长度,提升了大推理模型的事实准确性和鲁棒性,在多跳问答任务中优于现有方法且无需过度迭代。
English: ReaRAG enhances Large Reasoning Models by integrating retrieval-augmented generation with a controlled reasoning chain, improving factuality and robustness in multi-hop question answering without excessive iterations.

Authors:Shengkun Ma, Hao Peng, Lei Hou, Juanzi Li
Title: MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark
Abstract:
Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.
中文: 该摘要介绍了MRCEval,这是一个基于新型分类法和大语言模型构建的全面机器阅读理解基准,通过2,100道多选题评估13项不同技能,表明尽管模型有所进步,阅读理解仍是当前面临的重大挑战。
English: This abstract introduces MRCEval, a comprehensive machine reading comprehension benchmark developed using a novel taxonomy and large language models to assess 13 distinct skills through 2,100 multi-choice questions, revealing that reading comprehension remains challenging for current models despite their advancements.

Authors:Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli Ouyang
Title: Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs
Abstract:
Multi-Modal Large Language Models (MLLMs) stand out in various tasks but still struggle with hallucinations. While recent training-free mitigation methods mostly introduce additional inference overhead via retrospection strategy and contrastive decoding, we propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens, which further contributes to hallucinated responses because of the distribution gap between different token types. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors and ensures the decoding process depends more on the visual inputs. More interestingly, we find that, by controlling the intensity of AttnReal, we can achieve a wide-range trade-off between the response faithfulness and overall performance. Comprehensive results from different benchmarks validate the effectiveness of AttnReal across six open-source MLLMs and three decoding strategies.
中文: AttnReal是一种无需训练的新方法,通过将过度分配给输出标记的注意力重新分配给视觉标记,有效减少多模态大语言模型的幻觉现象,在保持性能的同时以近乎零额外成本提升回答的忠实度。
English: AttnReal is a novel training-free method that mitigates hallucinations in Multi-Modal Large Language Models by reallocating excessive attention from output tokens to visual tokens, achieving a trade-off between faithfulness and performance with minimal inference overhead.

Authors:Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen
Title: Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform
Abstract:
With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
Chinese: Delta-DCT提出了一种基于离散余弦变换的无数据差值压缩方法,无需训练或数据即可有效降低微调模型的存储成本,在1比特压缩比下对多种模型类型均保持优异性能。
English: Delta-DCT introduces a data-free delta compression method using Discrete Cosine Transform to efficiently reduce storage costs of finetuned models without requiring training or data, achieving high performance across various model types under 1-bit compression ratios.

Authors:Sijie Zhao, Feng Liu, Xueliang Zhang, Hao Chen, Tao Han, Junchao Gong, Ran Tao, Pengfeng Xiao, Lei Bai, Wanli Ouyang
Title: Transforming Weather Data from Pixel to Latent Space
Abstract:
The increasing impact of climate change and extreme weather events has spurred growing interest in deep learning for weather research. However, existing studies often rely on weather data in pixel space, which presents several challenges such as smooth outputs in model outputs, limited applicability to a single pressure-variable subset (PVS), and high data storage and computational costs. To address these challenges, we propose a novel Weather Latent Autoencoder (WLA) that transforms weather data from pixel space to latent space, enabling efficient weather task modeling. By decoupling weather reconstruction from downstream tasks, WLA improves the accuracy and sharpness of weather task model results. The incorporated Pressure-Variable Unified Module transforms multiple PVS into a unified representation, enhancing the adaptability of the model in multiple weather scenarios. Furthermore, weather tasks can be performed in a low-storage latent space of WLA rather than a high-storage pixel space, thus significantly reducing data storage and computational costs. Through extensive experimentation, we demonstrate its superior compression and reconstruction performance, enabling the creation of the ERA5-latent dataset with unified representations of multiple PVS from ERA5 data. The compressed full PVS in the ERA5-latent dataset reduces the original 244.34 TB of data to 0.43 TB. The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space. Code, ERA5-latent data, and pre-trained models are available at https://anonymous.4open.science/r/Weather-Latent-Autoencoder-8467.
中文: 本文提出了一种天气潜在自编码器(WLA),将天气数据从像素空间转换到潜在空间,不仅提高了模型精度和清晰度,还通过统一表示多气压变量子集显著降低了数据存储和计算成本。
English: This paper introduces a Weather Latent Autoencoder (WLA) that converts weather data from pixel to latent space, improving model accuracy and sharpness while drastically cutting storage and computational costs by enabling efficient multi-scenario weather task modeling.

Authors:Yiqun Zhang, Peng Ye, Xiaocui Yang, Shi Feng, Shufei Zhang, Lei Bai, Wanli Ouyang, Shuyue Hu
Title: Nature-Inspired Population-Based Evolution of Large Language Models
Abstract:
Evolution, the engine behind the survival and growth of life on Earth, operates through the population-based process of reproduction. Inspired by this principle, this paper formally defines a newly emerging problem -- the population-based evolution of large language models (LLMs) -- and introduces a novel framework. Starting with a population of parent LLMs, our framework enables the population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. With only 200 samples per new task, the LLM population evolves rapidly to adapt to the task at hand, without any gradients. Experiments on 12 datasets show that our framework consistently outperforms existing multi-LLM merging and adaptation methods, achieving accuracy gains of up to 54.8% over the best LLM in the initial population. Moreover, our framework allows for the evolution of LLMs across multiple new tasks simultaneously, scaling effectively with populations of up to 40 LLMs, and even zero-shot generalization to unseen held-out tasks. We have open-sourced the code on GitHub and released the weights of 10 parent LLMs, fine-tuned from gemma-2-2b-it, on HuggingFace$, enabling reproduction of our proposed framework using just a single 4090 GPU with 24GB memory, without any performance degradation.
中文: 本文提出了一种基于群体进化的大语言模型框架,通过交叉、突变、选择和继承操作,仅需少量样本即可让LLM群体无梯度地适应新任务,在多个基准测试中实现了显著性能提升。
English: This paper introduces a population-based evolution framework for large language models that uses crossover, mutation, selection, and succession operations to adapt LLMs to new tasks with minimal data and no gradients, achieving significant performance improvements across multiple benchmarks.

Authors:Jizhao Zhu, Akang Shi, Zixuan Li, Long Bai, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Title: Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution
Abstract:
In this paper, we aim to enhance the robustness of Universal Information Extraction (UIE) by introducing a new benchmark dataset, a comprehensive evaluation, and a feasible solution. Existing robust benchmark datasets have two key limitations: 1) They generate only a limited range of perturbations for a single Information Extraction (IE) task, which fails to evaluate the robustness of UIE models effectively; 2) They rely on small models or handcrafted rules to generate perturbations, often resulting in unnatural adversarial examples. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic perturbations across different IE tasks. Based on this dataset, we comprehensively evaluate existing UIE models and reveal that both LLM-based models and other models suffer from significant performance drops. To improve robustness and reduce training costs, we propose a data-augmentation solution that dynamically selects hard samples for iterative training based on the model's inference loss. Experimental results show that training with only \textbf{15\%} of the data leads to an average \textbf{7.5\%} relative performance improvement across three IE tasks.
中文: 本文提出了RUIE-Bench新基准,利用大语言模型生成多样化扰动以增强通用信息抽取的鲁棒性,并通过仅使用15%训练数据的数据增强方法实现了7.5%的性能提升。
English: This paper introduces RUIE-Bench, a new benchmark using LLMs to create diverse perturbations for robust Universal Information Extraction, and proposes a data-augmentation method that improves performance by 7.5% using only 15% of training data.

Authors:Wenxuan Liu, Zixuan Li, Long Bai, Yuxin Zuo, Daozhu Xu, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Title: Towards Event Extraction with Massive Types: LLM-based Collaborative Annotation and Partitioning Extraction
Abstract:
Developing a general-purpose extraction system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the challenge comes from two aspects: 1) The absence of an efficient and effective annotation method. 2) The absence of a powerful extraction method can handle massive types. For the first challenge, we propose a collaborative annotation method based on Large Language Models (LLMs). Through collaboration among multiple LLMs, it first refines annotations of trigger words from distant supervision and then carries out argument annotation. Next, a voting phase consolidates the annotation preferences across different LLMs. Finally, we create the EEMT dataset, the largest EE dataset to date, featuring over 200,000 samples, 3,465 event types, and 6,297 role types. For the second challenge, we propose an LLM-based Partitioning EE method called LLM-PEE. To overcome the limited context length of LLMs, LLM-PEE first recalls candidate event types and then splits them into multiple partitions for LLMs to extract events. The results in the supervised setting show that LLM-PEE outperforms the state-of-the-art methods by 5.4 in event detection and 6.1 in argument extraction. In the zero-shot setting, LLM-PEE achieves up to 12.9 improvement compared to mainstream LLMs, demonstrating its strong generalization capabilities.
中文: 本研究提出了一种基于大语言模型的协作标注方法,构建了规模最大的EEMT事件抽取数据集,并开发了LLM-PEE分区抽取技术,在监督学习和零样本场景下均显著优于现有最优方法。
English: This research introduces a collaborative LLM-based annotation method to create the largest event extraction dataset, EEMT, and proposes LLM-PEE, a partitioning extraction technique that significantly outperforms existing methods in both supervised and zero-shot settings.

Authors:Da Li, Keping Bi, Jiafeng Guo, Xueqi Cheng
Title: Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective
Abstract:
Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges. For example, different table fields have varying matching preferences: cells may favor finer-grained (word/phrase level) matching over broader (sentence/passage level) matching due to their fragmented and detailed nature, unlike titles. This necessitates a table-specific retriever to accommodate the various matching needs of each table field. Therefore, we introduce a Table-tailored HYbrid Matching rEtriever (THYME), which approaches table retrieval from a field-aware hybrid matching perspective. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that THYME significantly outperforms state-of-the-art baselines. Comprehensive analyses confirm the differing matching preferences across table fields and validate the design of THYME.
中文: 由于表格结构独特且各字段匹配需求不同,表格检索需要专门方法,因此提出的THYME检索器通过字段感知的混合匹配策略,在基准测试中显著优于现有方法。
English: Table retrieval requires specialized approaches due to the unique structure and varying matching preferences of table fields, leading to the development of THYME, a field-aware hybrid matching retriever that significantly outperforms existing methods on benchmarks.

Authors:Jianghao Lin, Peng Du, Jiaqi Liu, Weite Li, Yong Yu, Weinan Zhang, Yang Cao
Title: Sell It Before You Make It: Revolutionizing E-Commerce with Personalized AI-Generated Items
Abstract:
E-commerce has revolutionized retail, yet its traditional workflows remain inefficient, with significant time and resource costs tied to product design and manufacturing inventory. This paper introduces a novel system deployed at Alibaba that leverages AI-generated items (AIGI) to address these challenges with personalized text-to-image generation for e-commercial product design. AIGI enables an innovative business mode called "sell it before you make it", where merchants can design fashion items and generate photorealistic images with digital models based on textual descriptions. Only when the items have received a certain number of orders, do the merchants start to produce them, which largely reduces reliance on physical prototypes and thus accelerates time to market. For such a promising application, we identify the underlying key scientific challenge, i.e., capturing the users' group-level personalized preferences towards multiple generated candidate images. To this end, we propose a Personalized Group-Level Preference Alignment Framework for Diffusion Models (i.e., PerFusion). We first design PerFusion Reward Model for user preference estimation with a feature-crossing-based personalized plug-in. Then we develop PerFusion with a personalized adaptive network to model diverse preferences across users, and meanwhile derive the group-level preference optimization objective to capture the comparative behaviors among multiple candidates. Both offline and online experiments demonstrate the effectiveness of our proposed algorithm. The AI-generated items have achieved over 13% relative improvements for both click-through rate and conversion rate compared to their human-designed counterparts, validating the revolutionary potential of AI-generated items for e-commercial platforms.
Chinese: 本文介绍了阿里巴巴部署的新型AI生成商品系统,通过个性化文生图技术革新电商设计流程,实现"先售後产"模式以减少实体样品依赖并加速上市,其提出的PerFusion框架使点击率和转化率相对提升超13%。
English: This paper presents a novel AI-generated items (AIGI) system at Alibaba that enables personalized text-to-image generation for e-commerce, introducing a "sell it before you make it" model to reduce reliance on physical prototypes and accelerate time to market, with the proposed PerFusion framework achieving over 13% improvement in click-through and conversion rates.

Authors:Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu, Qingyao Ai
Title: Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task
Abstract:
In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.
中文: 本文介绍了NTCIR-18 AEOLLM任务,该任务通过生成式任务和无参考方法推动大语言模型的自动评估,包含多个子任务,并收到了来自4个团队的48份提交。
English: This paper introduces the NTCIR-18 AEOLLM task, which promotes automatic evaluation of large language models through generative tasks and reference-free methods, featuring diverse subtasks and receiving 48 submissions from 4 teams.

Authors:Xu Liu, Taha Aksu, Juncheng Liu, Qingsong Wen, Yuxuan Liang, Caiming Xiong, Silvio Savarese, Doyen Sahoo, Junnan Li, Chenghao Liu
Title: Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models
Abstract:
Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
中文摘要:时间序列基础模型和大语言模型时间序列模型的发展需要大规模高质量数据集,而合成数据作为可扩展的解决方案应运而生,本综述系统分析了其生成策略及其在模型预训练、微调与评估中的作用。
English Summary: Recent advances in Time Series Foundation Models and Large Language Model-based Time Series Models require large, high-quality datasets, with synthetic data emerging as a scalable solution, which this survey comprehensively reviews regarding generation strategies and their roles in model development.

Authors:Xiangyu Miao, Jun Sun, Hang Lai, Xinpeng Di, Jiahang Cao, Yong Yu, Weinan Zhang
Title: PALo: Learning Posture-Aware Locomotion for Quadruped Robots
Abstract:
With the rapid development of embodied intelligence, locomotion control of quadruped robots on complex terrains has become a research hotspot. Unlike traditional locomotion control approaches focusing solely on velocity tracking, we pursue to balance the agility and robustness of quadruped robots on diverse and complex terrains. To this end, we propose an end-to-end deep reinforcement learning framework for posture-aware locomotion named PALo, which manages to handle simultaneous linear and angular velocity tracking and real-time adjustments of body height, pitch, and roll angles. In PALo, the locomotion control problem is formulated as a partially observable Markov decision process, and an asymmetric actor-critic architecture is adopted to overcome the sim-to-real challenge. Further, by incorporating customized training curricula, PALo achieves agile posture-aware locomotion control in simulated environments and successfully transfers to real-world settings without fine-tuning, allowing real-time control of the quadruped robot's locomotion and body posture across challenging terrains. Through in-depth experimental analysis, we identify the key components of PALo that contribute to its performance, further validating the effectiveness of the proposed method. The results of this study provide new possibilities for the low-level locomotion control of quadruped robots in higher dimensional command spaces and lay the foundation for future research on upper-level modules for embodied intelligence.
中文摘要:本研究提出PALo框架,通过端到端深度强化学习实现四足机器人在复杂地形上的敏捷稳健运动控制,能同时追踪速度并实时调整身体姿态,无需微调即可从仿真环境迁移至现实场景,为四足机器人高维指令空间的基础运动控制开辟了新途径。
English Summary: This study introduces PALo, an end-to-end deep reinforcement learning framework that enables agile and robust posture-aware locomotion control for quadruped robots across complex terrains by simultaneously tracking velocities and adjusting body posture, successfully bridging simulation-to-reality gaps without fine-tuning.

Authors:Jia Chen, Qian Dong, Haitao Li, Xiaohui He, Yan Gao, Shaosheng Cao, Yi Wu, Ping Yang, Chen Xu, Yao Hu, Qingyao Ai, Yiqun Liu
Title: Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions
Abstract:
User-generated content (UGC) communities, especially those featuring multimodal content, improve user experiences by integrating visual and textual information into results (or items). The challenge of improving user experiences in complex systems with search and recommendation (S\&R) services has drawn significant attention from both academia and industry these years. However, the lack of high-quality datasets has limited the research progress on multimodal S\&R. To address the growing need for developing better S\&R services, we present a novel multimodal information retrieval dataset in this paper, namely Qilin. The dataset is collected from Xiaohongshu, a popular social platform with over 300 million monthly active users and an average search penetration rate of over 70\%. In contrast to existing datasets, \textsf{Qilin} offers a comprehensive collection of user sessions with heterogeneous results like image-text notes, video notes, commercial notes, and direct answers, facilitating the development of advanced multimodal neural retrieval models across diverse task settings. To better model user satisfaction and support the analysis of heterogeneous user behaviors, we also collect extensive APP-level contextual signals and genuine user feedback. Notably, Qilin contains user-favored answers and their referred results for search requests triggering the Deep Query Answering (DQA) module. This allows not only the training \& evaluation of a Retrieval-augmented Generation (RAG) pipeline, but also the exploration of how such a module would affect users' search behavior. Through comprehensive analysis and experiments, we provide interesting findings and insights for further improving S\&R systems. We hope that \textsf{Qilin} will significantly contribute to the advancement of multimodal content platforms with S\&R services in the future.
中文: 本文提出名为"Qilin"的新型多模态数据集,该数据集源自小红书平台,通过提供多样化用户会话和上下文信号来解决多模态搜索推荐研究中高质量数据匮乏的问题,从而推进用户体验建模的发展。
English: This paper introduces Qilin, a novel multimodal dataset from Xiaohongshu that addresses the scarcity of high-quality data for multimodal search and recommendation research by providing diverse user sessions and contextual signals to enhance user experience modeling.

Authors:Haoqi Huang, Ping Wang, Jianhua Pei, Jiacheng Wang, Shahen Alexanian, Dusit Niyato
Title: Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey
Abstract:
The rapid expansion of data from diverse sources has made anomaly detection (AD) increasingly essential for identifying unexpected observations that may signal system failures, security breaches, or fraud. As datasets become more complex and high-dimensional, traditional detection methods struggle to effectively capture intricate patterns. Advances in deep learning have made AD methods more powerful and adaptable, improving their ability to handle high-dimensional and unstructured data. This survey provides a comprehensive review of over 180 recent studies, focusing on deep learning-based AD techniques. We categorize and analyze these methods into reconstruction-based and prediction-based approaches, highlighting their effectiveness in modeling complex data distributions. Additionally, we explore the integration of traditional and deep learning methods, highlighting how hybrid approaches combine the interpretability of traditional techniques with the flexibility of deep learning to enhance detection accuracy and model transparency. Finally, we identify open issues and propose future research directions to advance the field of AD. This review bridges gaps in existing literature and serves as a valuable resource for researchers and practitioners seeking to enhance AD techniques using deep learning.
中文: 本综述系统评述了180余项基于深度学习的异常检测研究,将其分为重构型和预测型方法,并强调融合传统方法可解释性与深度学习灵活性的混合策略,以推动该领域发展。
English: This survey comprehensively reviews over 180 deep learning-based anomaly detection studies, categorizing them into reconstruction and prediction approaches while highlighting hybrid methods that combine traditional interpretability with deep learning flexibility to advance the field.

Authors:Jiahui Li, Geng Sun, Qingqing Wu, Shuang Liang, Jiacheng Wang, Dusit Niyato, Dong In Kim
Title: Aerial Secure Collaborative Communications under Eavesdropper Collusion in Low-altitude Economy: A Generative Swarm Intelligent Approach
Abstract:
In this work, we aim to introduce distributed collaborative beamforming (DCB) into AAV swarms and handle the eavesdropper collusion by controlling the corresponding signal distributions. Specifically, we consider a two-way DCB-enabled aerial communication between two AAV swarms and construct these swarms as two AAV virtual antenna arrays. Then, we minimize the two-way known secrecy capacity and maximum sidelobe level to avoid information leakage from the known and unknown eavesdroppers, respectively. Simultaneously, we also minimize the energy consumption of AAVs when constructing virtual antenna arrays. Due to the conflicting relationships between secure performance and energy efficiency, we consider these objectives by formulating a multi-objective optimization problem, which is NP-hard and with a large number of decision variables. Accordingly, we design a novel generative swarm intelligence (GenSI) framework to solve the problem with less overhead, which contains a conditional variational autoencoder (CVAE)-based generative method and a proposed powerful swarm intelligence algorithm. In this framework, CVAE can collect expert solutions obtained by the swarm intelligence algorithm in other environment states to explore characteristics and patterns, thereby directly generating high-quality initial solutions in new environment factors for the swarm intelligence algorithm to search solution space efficiently. Simulation results show that the proposed swarm intelligence algorithm outperforms other state-of-the-art baseline algorithms, and the GenSI can achieve similar optimization results by using far fewer iterations than the ordinary swarm intelligence algorithm. Experimental tests demonstrate that introducing the CVAE mechanism achieves a 58.7% reduction in execution time, which enables the deployment of GenSI even on AAV platforms with limited computing power.
本研究将分布式协作波束成形引入AAV集群,通过控制信号分布和能量消耗,利用创新的生成式群体智能框架有效应对窃听者威胁并提升通信安全性能。
This study introduces distributed collaborative beamforming into AAV swarms to counter eavesdropper threats by optimizing signal distribution and energy efficiency through a novel generative swarm intelligence framework.

Authors:Wenhan Liu, Xinyu Ma, Yutao Zhu, Lixin Su, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou
Title: CoRanking: Collaborative Ranking with Small and Large Ranking Agents
Abstract:
Large Language Models (LLMs) have demonstrated superior listwise ranking performance. However, their superior performance often relies on large-scale parameters (\eg, GPT-4) and a repetitive sliding window process, which introduces significant efficiency challenges. In this paper, we propose \textbf{CoRanking}, a novel collaborative ranking framework that combines small and large ranking models for efficient and effective ranking. CoRanking first employs a small-size reranker to pre-rank all the candidate passages, bringing relevant ones to the top part of the list (\eg, top-20). Then, the LLM listwise reranker is applied to only rerank these top-ranked passages instead of the whole list, substantially enhancing overall ranking efficiency. Although more efficient, previous studies have revealed that the LLM listwise reranker have significant positional biases on the order of input passages. Directly feed the top-ranked passages from small reranker may result in the sub-optimal performance of LLM listwise reranker. To alleviate this problem, we introduce a passage order adjuster trained via reinforcement learning, which reorders the top passages from the small reranker to align with the LLM's preferences of passage order. Extensive experiments on three IR benchmarks demonstrate that CoRanking significantly improves efficiency (reducing ranking latency by about 70\%) while achieving even better effectiveness compared to using only the LLM listwise reranker.
中文:CoRanking是一种高效的协同排序框架,结合小型与大型排序模型,先通过小型重排器预排序候选段落,再由大语言模型仅对顶部段落进行重排,并通过顺序调整器缓解位置偏差,在提升效果的同时将排序延迟降低约70%。
English: CoRanking is an efficient collaborative ranking framework that combines small and large models, using a small reranker for initial ranking and an LLM for final reranking of top passages only, enhanced by an order adjuster to mitigate positional bias, achieving 70% faster latency with improved effectiveness.

Authors:Yinan Liang, Ziwei Wang, Xiuwei Xu, Jie Zhou, Jiwen Lu
Title: EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
Abstract:
While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.
中文: 本文提出一种针对大型视觉语言模型的自动剪枝方法,通过少量样本优化剪枝策略,在视觉问答任务中实现了准确性与效率的平衡,获得了显著加速且保持性能。
English: This paper introduces an automatic pruning method for large vision-language models that uses minimal samples to optimize pruning policies, achieving a balance between accuracy and efficiency with significant speedup and maintained performance on visual question answering tasks.

Authors:Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
Title: DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Abstract:
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/
中文: DiffMoE通过引入批量级全局令牌池和容量预测器动态分配计算资源,在ImageNet上实现顶尖性能,并在文本到图像生成等任务中表现卓越。
English: DiffMoE introduces a batch-level global token pool and a capacity predictor to dynamically allocate computational resources, achieving state-of-the-art performance on ImageNet and excelling in text-to-image generation.

Authors:Hang Yin, Xiuwei Xu, Lingqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
Title: UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Abstract:
In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. Existing zero-shot methods build inference framework upon large language models (LLM) for specific tasks, which differs a lot in overall pipeline and fails to generalize across different types of goal. Towards the aim of universal zero-shot navigation, we propose a uniform graph representation to unify different goals, including object category, instance image and text description. We also convert the observation of agent into an online maintained scene graph. With this consistent scene and goal representation, we preserve most structural information compared with pure text and are able to leverage LLM for explicit graph-based reasoning. Specifically, we conduct graph matching between the scene graph and goal graph at each time instant and propose different strategies to generate long-term goal of exploration according to different matching states. The agent first iteratively searches subgraph of goal when zero-matched. With partial matching, the agent then utilizes coordinate projection and anchor pair alignment to infer the goal location. Finally scene graph correction and goal verification are applied for perfect matching. We also present a blacklist mechanism to enable robust switch between stages. Extensive experiments on several benchmarks show that our UniGoal achieves state-of-the-art zero-shot performance on three studied navigation tasks with a single model, even outperforming task-specific zero-shot methods and supervised universal methods.
中文: 本文提出UniGoal通用零样本导航框架,通过统一图表示整合不同目标,并利用大语言模型进行显式图推理,以单一模型在多项导航任务中实现了最先进的零样本性能。
English: This paper introduces UniGoal, a universal zero-shot navigation framework that unifies diverse goals through uniform graph representations and leverages LLMs for explicit graph-based reasoning, achieving state-of-the-art performance across multiple navigation tasks with a single model.

Authors:Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: API Agents vs. GUI Agents: Divergence and Convergence
Abstract:
Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API-based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI-based LLM agents that interact with graphical user interfaces in a human-like manner. Although these two paradigms share the goal of enabling LLM-driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents, paving the way for more flexible, adaptive solutions in a wide range of real-world applications.
中文摘要:本文首次对基于API和基于GUI的LLM智能体进行对比研究,分析其差异与融合潜力,并提出结合两者优势的混合方案以指导实际应用。
English Summary: This paper conducts the first comparative analysis of API-based and GUI-based LLM agents, examining their differences and potential integration while proposing hybrid approaches for practical applications.

Authors:Qiao Liang, Yanjiang Liu, Weixiang Zhou, Ben He, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun, Yingfei Sun
Title: Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Abstract:
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
中文: 本研究探讨了视觉编码器的先验知识对多模态大语言模型(MLLM)的影响,引入新指标Rank_e揭示先验知识与MLLM性能呈正相关,并提出两阶段训练框架VisPRE,通过明确融入先验知识来增强视觉理解能力。
English: This study investigates how the prior knowledge of vision encoders affects Multi-modal Large Language Models (MLLMs) and introduces a new metric, Rank_e, revealing a positive correlation between prior knowledge and MLLM performance, while proposing a two-stage training framework, VisPRE, to enhance visual understanding by explicitly incorporating such knowledge.

Authors:Haoran Chen, Ping Wang, Zihan Zhou, Xu Zhang, Zuxuan Wu, Yu-Gang Jiang
Title: Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning
Abstract:
Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token's attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity-both in terms of inference costs and the number of trainable parameters-but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising candidate for a general parameter-efficient fine-tuning approach.
中文: 本研究提出了一种新颖的基于提示的类增量学习方法,通过共享提示并直接修改CLS令牌的注意力计算,显著降低了计算复杂度,无需优化提示长度,同时在多种基准测试中保持了强劲性能。
English: This work introduces a novel prompt-based method for class-incremental learning that uses shared prompts and modifies the CLS token's attention, significantly reducing computational complexity and eliminating the need for prompt length optimization while maintaining strong performance across various benchmarks.

Authors:Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixiang Zhou, Le Sun, Yingfei Sun
Title: Large Language Models Often Say One Thing and Do Another
Abstract:
As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space.
中文摘要:本文通过开发“言行一致性测试”(WDCT)基准,评估大型语言模型在言语与行动之间的一致性,发现不同模型和领域普遍存在不一致性,且单方面对齐难以有效影响另一方面。
English Summary: This paper introduces the Words and Deeds Consistency Test (WDCT) to evaluate the alignment between large language models' stated responses and their actions, revealing significant inconsistencies across various domains and models, and demonstrating that alignment in one aspect does not reliably transfer to the other.

Authors:Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun
Title: The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
Abstract:
Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
中文摘要:本文提出的捷径感知多模态奖励模型学习算法通过动态调整训练样本权重,有效解决了多模态奖励模型的泛化问题,减少了对单模态伪相关性的依赖,显著提升了模型的泛化能力和下游任务表现。
English Summary: The proposed Shortcut-aware MM-RM learning algorithm effectively addresses generalization issues in multimodal reward models by dynamically reweighting training samples to reduce reliance on unimodal spurious correlations and enhance true multimodal understanding.

Authors:Baining Zhao, Jianjie Fang, Zichao Dai, Ziyou Wang, Jirong Zha, Weichen Zhang, Chen Gao, Yue Wang, Jinqiang Cui, Xinlei Chen, Yong Li
Title: UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
Abstract:
Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.
中文: 本研究提出一个基准来评估视频大语言模型在城市三维环境中的具身认知能力,通过对17个模型的评估揭示了当前局限,并发现因果推理与其他任务存在强相关性。
English: This study introduces a benchmark to assess video-large language models' embodied cognitive abilities in urban 3D environments, revealing current limitations through evaluations of 17 models and highlighting strong correlations between causal reasoning and other tasks.

Authors:Jinghao Zhang, Yuting Liu, Wenjie Wang, Qiang Liu, Shu Wu, Liang Wang, Tat-Seng Chua
Title: Personalized Text Generation with Contrastive Activation Steering
Abstract:
Personalized text generation aims to infer users' writing style preferences from their historical texts and generate outputs that faithfully reflect these stylistic characteristics. Existing solutions primarily adopt two paradigms: retrieval-augmented generation (RAG) and parameter-efficient fine-tuning (PEFT). While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG's inference latency by retrieval operations and PEFT's parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM's activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage. Comprehensive experiments demonstrate that our framework achieves a significant 8% relative improvement in personalized generation while reducing storage requirements by 1700 times over PEFT method.
中文: 提出的StyleVector框架通过将写作风格表示为LLM激活空间中的向量,克服了现有个性化文本生成方法的局限,在不需训练的情况下实现了8%的性能提升,同时将存储需求降低了1700倍。
English: The proposed StyleVector framework overcomes limitations in existing personalized text generation methods by representing writing styles as vectors in LLM activation space, achieving 8% performance improvement while reducing storage requirements by 1700 times without training.

Authors:Xiangxin Zhou, Yi Xiao, Haowei Lin, Xinheng He, Jiaqi Guan, Yang Wang, Qiang Liu, Feng Zhou, Liang Wang, Jianzhu Ma
Title: Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows
Abstract:
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
Chinese: 本研究提出DynamicFlow生成模型,利用分子动力学数据模拟蛋白质口袋构象变化并生成配体分子,通过提供更优的类全结合状态来增强传统基于结构的药物设计,在药物发现中发挥重要作用。
English: The study introduces DynamicFlow, a generative model that uses molecular dynamics data to simulate protein pocket conformational changes and generate ligand molecules, enhancing traditional structure-based drug design by providing superior holo-like states for drug discovery.

Authors:Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel
Title: SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Abstract:
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.
中文:SceneSplat提出了首个基于3D高斯泼溅的大规模室内场景理解方法,通过自监督学习方案和SceneSplat-7K数据集实现了标准化基准测试,显著优于现有基线。
English: SceneSplat introduces the first large-scale 3D indoor scene understanding method that operates natively on 3D Gaussian Splatting, along with a self-supervised learning scheme and the SceneSplat-7K dataset to enable standardized benchmarking.

Authors:Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, Maosong Sun, Yang Liu
Title: AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
Abstract:
Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
中文摘要:AdaMMS是一种专为异构多模态大语言模型设计的新型模型融合方法,通过映射、合并和无监督超参数搜索解决架构差异和参数不对称问题,在多种视觉语言任务上优于现有方法。
English Summary: AdaMMS is a novel model merging method designed for heterogeneous Multimodal Large Language Models, addressing architectural differences and parameter asymmetry through mapping, merging, and unsupervised hyper-parameter search to achieve superior performance on vision-language tasks.

Authors:Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan
Title: GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Abstract:
The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.
中文: 本研究提出GenHancer方法,通过优化条件机制、去噪策略和生成范式来利用生成模型增强CLIP的细粒度视觉感知能力,在MMVP-VLM等基准测试中取得了显著优于现有技术的性能表现。
English: This study introduces GenHancer, a method that enhances CLIP's fine-grained visual perception by strategically leveraging generative models through optimized conditioning, denoising, and generation techniques, achieving superior performance on benchmarks like MMVP-VLM.

Authors:Kechi Zhang, Huangzhao Zhang, Ge Li, Jinliang You, Jia Li, Yunfei Zhao, Zhi Jin
Title: SEAlign: Alignment Training for Software Engineering Agent
Abstract:
Recent advances in code generation models have demonstrated impressive capabilities in automating software development tasks, yet these models still struggle in real-world software engineering scenarios. Although current training methods, particularly post-training, excel at solving competitive programming problems, they fail to adequately prepare models for the complexities of practical software development. This misalignment raises the critical question: Are existing alignment training methods well suited for real-world software engineering tasks? In this study, we identify this issue and propose SEAlign, a novel alignment framework designed to bridge the gap between code generation models and real-world software development tasks. SEAlign leverages the unique characteristics of software engineering processes, including high-quality workflow steps, to enhance model capabilities. Our framework further employs Monte Carlo Tree Search for fine-grained alignment in multi-step decision processes, followed by preference optimization on critical actions to ensure models meet real-world requirements. We evaluate SEAlign on three standard agentic benchmarks for real-world software engineering, including HumanEvalFix, SWE-Bench-Lite, and SWE-Bench-Verified. Experimental results demonstrate state-of-the-art performance with minimal training overhead. In addition, we develop an agent-based software development platform using SEAlign, which successfully automates the creation of several small applications. Human evaluations of these applications highlight significant improvements in both task performance and user experience. Our findings underscore the potential of SEAlign to accelerate the adoption of large code models in real-world software development. We believe that this research makes a meaningful step towards fully automated software engineering.
中文摘要:本研究提出SEAlign框架,通过利用软件工程流程和蒙特卡洛树搜索技术,弥合代码生成模型与实际软件开发之间的差距,在标准基准测试中取得最优性能,并显著提升了自动化应用开发的效果。
English Summary: The study introduces SEAlign, an alignment framework that enhances code generation models for real-world software engineering by leveraging workflow processes and Monte Carlo Tree Search, achieving state-of-the-art results on benchmarks and improving automated application development.

Authors:Yuqi Zhu, Ge Li, Xue Jiang, Jia Li, Hong Mei, Zhi Jin, Yihong Dong
Title: Uncertainty-Guided Chain-of-Thought for Code Generation with LLMs
Abstract:
Chain-of-Thought (CoT) reasoning has been demonstrated as an effective technique for improving the problem-solving capabilities of large language models (LLMs) in the context of code generation. However, existing CoT methods often exhibit a tendency toward "overthinking", where the LLM consistently applies reasoning strategies without adequately considering the task's underlying complexity. This results in the LLMs allocating excessive computational resources, in terms of tokens, to relatively simple tasks or problems where the correct answer is already evident. Additionally, this overthinking may lead LLMs down incorrect reasoning paths, resulting in incorrect code generation. In this paper, we introduce UnCertainty-Aware Chain-of-Thought (UnCert-CoT), an LLM-based approach designed to enhance code generation by incorporating an uncertainty-aware CoT reasoning mechanism, which focuses computational resources on targeting points where LLMs are more prone to error. We propose two confidence-based uncertainty measures: Entropy-based and Probability Differential-based methods. When uncertainty is high, UnCert-CoT activates CoT-decoding to generate multiple reasoning paths and selects the final code that exhibits the highest likelihood of correctness. In contrast, LLM directly generates the code when uncertainty is low. This uncertainty judgment mechanism allows LLMs to prioritize complex tasks and avoid unnecessary steps in simpler cases, thereby improving overall efficiency and accuracy in code generation. Our experimental results demonstrate that UnCert-CoT significantly enhances code generation accuracy on challenging benchmark MHPP(Mostly Hard Python Problems), it achieves improvements up to 6.1% on PassRate accuracy, particularly in situations where traditional LLMs are prone to errors.
中文: UnCert-CoT是一种不确定性感知的思维链方法,通过仅在语言模型面临高不确定性时启动推理机制,有效优化代码生成效率,并在复杂任务中显著提升准确性。
English: UnCert-CoT is an uncertainty-aware Chain-of-Thought approach that optimizes code generation by activating reasoning only when LLMs face high uncertainty, thereby improving both efficiency and accuracy on complex tasks.

Authors:Jia Li, Hao Zhu, Huanyu Liu, Xianjie Shi, He Zong, Yihong Dong, Kechi Zhang, Siyuan Jiang, Zhi Jin, Ge Li
Title: aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion
Abstract:
Repository-level code completion aims to complete code based on the long contexts of the repository. Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code. However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion. In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information. We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts. To address this, we propose a novel fine-tuning approach named CoLT. The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information. Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts. To support CoLT, we release CoLT-132K, a large-scale dataset with 132k samples across four languages, each containing long-context inputs. We apply CoLT to a popular LLM - aiXcoder-7B and release aiXcoder-7B-v2. We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval. Our experiments yield the results: 1. Effectiveness. CoLT substantially improves aiXcoder-7B. aiXcoder-7B-v2 outperforms aiXcoder-7B by up to 44% in exact match. aiXcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models. 2. Generalizability. The capability learned by CoLT can generalize to new languages. Besides, CoLT is model-agnostic and effectively improves multiple LLMs. 3. Enhanced Context Utilization Capability. CoLT significantly improves the capability of LLMs in utilizing the relevant information within long contexts.
中文: 本研究提出CoLT微调方法,通过强化学习训练增强大语言模型在仓库级代码补全中利用长上下文信息的能力,显著提升模型性能并实现跨编程语言的泛化能力。
English: This study introduces CoLT, a fine-tuning method that enhances large language models' ability to utilize long-context information in repository-level code completion, achieving state-of-the-art performance and improved generalization across programming languages.

Authors:Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao
Title: HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Abstract:
Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
Chinese: 近期大语言模型的进展推动了大型多模态模型的发展,但现有原生模型存在资源和性能问题,本研究提出一种高效的早期融合方法,有效缩小了与组合模型之间的性能差距。
English: Recent progress in large language models has spurred the development of large multi-modal models, yet current native approaches face resource and performance challenges, which this work addresses by proposing an efficient early-fusion model that narrows the performance gap with compositional models.

Authors:Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu
Title: CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
中文摘要:视觉语言模型在连续空间感知方面仍存在不足,CoSpace基准测试通过评估静态视角下空间连续图像的理解能力,揭示了专有和开源模型均面临挑战,其中响应一致性是主要差异所在。
English Summary: Vision-Language Models (VLMs) still struggle with continuous space perception, as revealed by the CoSpace benchmark, which evaluates their ability to understand spatially continuous images from static viewpoints and shows that both proprietary and open-source models face challenges, with response consistency being a key differentiator.

Authors:Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu
Title: EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability
Abstract:
The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.
中文摘要:MM-Escape基准通过密室逃脱场景全面评估多模态大语言模型的推理能力,实验表明尽管部分模型展现出类人探索策略,但随着任务难度增加性能急剧下降,并暴露出重复轨迹、空间感知薄弱等差异化缺陷模式。
English Summary: The MM-Escape benchmark is introduced to comprehensively evaluate multimodal reasoning in MLLMs through escape game scenarios, revealing significant performance drops with increasing difficulty and distinct failure patterns despite some models showing human-like exploration strategies.

Authors:Jia Li, Xuyuan Guo, Lei Li, Kechi Zhang, Ge Li, Jia Li, Zhengwei Tao, Fang Liu, Chongyang Tao, Yuqi Zhu, Zhi Jin
Title: LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding
Abstract:
Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LONGCODEU from four aspects (8 tasks) to evaluate LCLMs' long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LONGCODEU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs' capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K-1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.
中文: 当前长上下文语言模型缺乏针对长代码理解的严格评估框架,为此我们提出LONGCODEU基准来测试其能力,结果发现模型在代码长度超过32K时性能显著下降,远未达到其宣称的上下文窗口能力。
English: Current long-context language models lack a rigorous evaluation framework for long code understanding, so we propose the LONGCODEU benchmark to assess their capabilities and reveal significant performance limitations, especially beyond 32K code length.

Authors:Ziyang Ma, Zuchao Li, Lefei Zhang, Gui-Song Xia, Bo Du, Liangpei Zhang, Dacheng Tao
Title: Model Hemorrhage and the Robustness Limits of Large Language Models
Abstract:
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment through quantization, pruning, or decoding strategy adjustments. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes. Through systematic analysis of various LLM frameworks, we identify key vulnerability patterns: layer expansion frequently disrupts attention mechanisms, compression techniques induce information loss cascades, and decoding adjustments amplify prediction divergences. Our investigation reveals transformer architectures exhibit inherent robustness thresholds that determine hemorrhage severity across modification types. We propose three mitigation strategies: gradient-aware pruning preserves critical weight pathways, dynamic quantization scaling maintains activation integrity, and decoding calibration aligns generation trajectories with original model distributions. This work establishes foundational metrics for evaluating model stability during adaptation, providing practical guidelines for maintaining performance while enabling efficient LLM deployment. Our findings advance understanding of neural network resilience under architectural transformations, particularly for large-scale language models.
中文摘要:大语言模型在部署过程中因量化、剪枝等修改出现性能下降,即模型出血现象,可通过梯度感知剪枝和动态量化缩放等策略缓解,以保持模型效率与稳定性。
English Summary: Large language models suffer performance degradation from deployment modifications like quantization and pruning, termed model hemorrhage, which can be mitigated through strategies such as gradient-aware pruning and dynamic quantization scaling to maintain efficiency and stability.

Authors:Shuai Li, Jie Zhang, Yuang Qi, Kejiang Chen, Tianwei Zhang, Weiming Zhang, Nenghai Yu
Title: Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing
Abstract:
Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data \textbf{p}oisoning \textbf{a}ttacks against \textbf{d}eep \textbf{hash}ing \textbf{(\textit{PADHASH})}. Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.
中文: 深度哈希图像检索易受数据投毒攻击,即使查询图像干净也可能引发恶意结果,本文提出基于梯度匹配的PADHASH方法生成投毒图像,经多场景验证具有普适攻击效果。
English: Deep hashing for image retrieval is susceptible to data poisoning attacks, where clean queries can trigger malicious results, and this paper introduces PADHASH, a gradient-based method to generate poisoned images that proves effective across various settings.

Authors:Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li
Title: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis
Abstract:
Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.
Chinese Summary: FlexWorld是一种创新框架,通过视频到视频扩散模型和渐进扩展过程,从单张图像生成高质量灵活视角的3D场景,在视觉质量上超越了现有最优方法。
English Summary: FlexWorld is a novel framework that uses a video-to-video diffusion model and progressive expansion to generate high-quality flexible-view 3D scenes from single images, outperforming existing methods in visual quality.

Authors:Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang
Title: Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
Abstract:
A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.
中文: 本研究为自回归视频扩散模型建立了理论框架,揭示了其固有的记忆瓶颈和误差累积问题,并通过改进网络结构和帧压缩技术有效提升了模型性能。
English: This study develops a theoretical framework for Auto-Regressive Video Diffusion Models, identifying inherent memory bottlenecks and error accumulation while proposing network improvements and frame compression to enhance performance.

Authors:Gangyang Li, Xiuwei Shang, Shaoyin Cheng, Junqi Zhang, Li Hu, Xu Zhu, Weiming Zhang, Nenghai Yu
Title: Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code
Abstract:
Type recovery is a crucial step in binary code analysis, holding significant importance for reverse engineering and various security applications. Existing works typically simply target type identifiers within binary code and achieve type recovery by analyzing variable characteristics within functions. However, we find that the types in real-world binary programs are more complex and often follow specific distribution patterns. In this paper, to gain a profound understanding of the variable type recovery problem in binary code, we first conduct a comprehensive empirical study. We utilize the TYDA dataset, which includes 163,643 binary programs across four architectures and four compiler optimization options, fully reflecting the complexity and diversity of real-world programs. We carefully study the unique patterns that characterize types and variables in binary code, and also investigate the impact of compiler optimizations on them, yielding many valuable insights. Based on our empirical findings, we propose ByteTR, a framework for recovering variable types in binary code. We decouple the target type set to address the issue of unbalanced type distribution and perform static program analysis to tackle the impact of compiler optimizations on variable storage. In light of the ubiquity of variable propagation across functions observed in our study, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery. We conduct extensive experiments to evaluate the performance of ByteTR. The results demonstrate that ByteTR leads state-of-the-art works in both effectiveness and efficiency. Moreover, in real CTF challenge case, the pseudo code optimized by ByteTR significantly improves readability, surpassing leading tools IDA and Ghidra.
中文: 本文提出ByteTR框架,通过解耦类型集以应对分布不平衡、结合跨函数传播分析和门控图神经网络来捕获长程数据流依赖,有效解决了二进制代码中变量类型恢复的复杂性,在效果和效率上均优于现有最优方法。
English: This paper introduces ByteTR, a novel framework that addresses the complexities of variable type recovery in binary code by incorporating inter-procedural analysis and a gated graph neural network to handle unbalanced type distributions and compiler optimizations, achieving state-of-the-art performance in both effectiveness and efficiency.

Authors:Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, Dacheng Tao
Title: Benchmarking Reasoning Robustness in Large Language Models
Abstract:
Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.
中文: 大型语言模型存在四大关键局限性——位置偏差、指令敏感性、数值脆弱性和记忆依赖性,揭示了其依赖记忆模式而非系统推理的缺陷,为此提出Math-RoB基准以全面评估并提升推理鲁棒性。
English: Large language models exhibit reasoning fragility due to four key limitations—positional bias, instruction sensitivity, numerical fragility, and memory dependence—revealing reliance on memorized patterns over systematic logic, prompting the introduction of the Math-RoB benchmark to assess and improve robustness.

Authors:Chang Liu, Haolin Wu, Xi Yang, Kui Zhang, Cong Wu, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, Jie Zhang
Title: Exploiting Vulnerabilities in Speech Translation Systems through Targeted Adversarial Attacks
Abstract:
As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two innovative approaches: (1) the injection of perturbation into source audio, and (2) the generation of adversarial music designed to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks prove effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. The implications of this research extend beyond immediate security concerns, shedding light on the interpretability and robustness of neural speech processing systems. Our findings underscore the need for advanced defense mechanisms and more resilient architectures in the realm of audio systems. More details and samples can be found at https://adv-st.github.io.
中文摘要:本研究揭示了通过不可察觉的音频扰动和对抗性音乐可操纵语音翻译系统产生定向有害输出,在多语言和多模型中暴露出系统性漏洞,同时强调了开发更强防御机制的必要性。
English Summary: This study demonstrates how imperceptible audio perturbations and adversarial music can manipulate speech translation systems into producing targeted harmful outputs, revealing systemic vulnerabilities across multiple languages and models while highlighting the need for stronger defense mechanisms.

Authors:Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang
Title: WritingBench: A Comprehensive Benchmark for Generative Writing
Abstract:
Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
中文摘要:WritingBench作为一个综合性基准测试被提出,用于评估大语言模型在多样化写作领域的表现,其动态评估框架能够实现精准评判,并助力较小参数模型接近最先进水平。
English Summary: WritingBench is introduced as a comprehensive benchmark to evaluate large language models across diverse writing domains, featuring a dynamic evaluation framework that enables precise assessment and helps smaller models achieve near-state-of-the-art performance.

Authors:Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, Mengyue Wu
Title: MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Abstract:
The rapid advancement of large language models (LLMs) and artificial intelligence-generated content (AIGC) has accelerated AI-native applications, such as AI-based storybooks that automate engaging story production for children. However, challenges remain in improving story attractiveness, enriching storytelling expressiveness, and developing open-source evaluation benchmarks and frameworks. Therefore, we propose and opensource MM-StoryAgent, which creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. MM-StoryAgent designs a multi-agent framework that employs LLMs and diverse expert tools (generative models and APIs) across several modalities to produce expressive storytelling videos. The framework enhances story attractiveness through a multi-stage writing pipeline. In addition, it improves the immersive storytelling experience by integrating sound effects with visual, music and narrative assets. MM-StoryAgent offers a flexible, open-source platform for further development, where generative modules can be substituted. Both objective and subjective evaluation regarding textual story quality and alignment between modalities validate the effectiveness of our proposed MM-StoryAgent system. The demo and source code are available.
中文:提出的MM-StoryAgent是一个开源多智能体框架,利用大语言模型和生成工具创建具有增强情节、一致视觉和多声道音频的沉浸式视频故事书,解决了故事吸引力和表现力方面的挑战,同时为后续开发提供了灵活平台。
English: The proposed MM-StoryAgent is an open-source multi-agent framework that uses LLMs and generative tools to create immersive video storybooks with enhanced plots, consistent visuals, and multi-channel audio, addressing challenges in story attractiveness and expressiveness while providing a flexible platform for further development.

Authors:Ziqing Yang, Yixin Wu, Yun Shen, Wei Dai, Michael Backes, Yang Zhang
Title: The Challenge of Identifying the Origin of Black-Box Large Language Models
Abstract:
The tremendous commercial potential of large language models (LLMs) has heightened concerns about their unauthorized use. Third parties can customize LLMs through fine-tuning and offer only black-box API access, effectively concealing unauthorized usage and complicating external auditing processes. This practice not only exacerbates unfair competition, but also violates licensing agreements. In response, identifying the origin of black-box LLMs is an intrinsic solution to this issue. In this paper, we first reveal the limitations of state-of-the-art passive and proactive identification methods with experiments on 30 LLMs and two real-world black-box APIs. Then, we propose the proactive technique, PlugAE, which optimizes adversarial token embeddings in a continuous space and proactively plugs them into the LLM for tracing and identification. The experiments show that PlugAE can achieve substantial improvement in identifying fine-tuned derivatives. We further advocate for legal frameworks and regulations to better address the challenges posed by the unauthorized use of LLMs.
中文: 大型语言模型通过微调和黑盒API的未经授权使用加剧了不公平竞争和许可违规,为此提出的主动识别方法PlugAE通过优化对抗性令牌来追踪衍生模型,并呼吁加强法律框架以应对挑战。
English: The unauthorized use of large language models through fine-tuning and black-box APIs raises concerns about unfair competition and licensing violations, prompting the development of PlugAE, a proactive identification method that optimizes adversarial tokens to trace derivatives and calls for stronger legal frameworks.

Authors:Bartlomiej Surma, Michael Backes, Yang Zhang
Title: Fairness and/or Privacy on Social Graphs
Abstract:
Graph Neural Networks (GNNs) have shown remarkable success in various graph-based learning tasks. However, recent studies have raised concerns about fairness and privacy issues in GNNs, highlighting the potential for biased or discriminatory outcomes and the vulnerability of sensitive information. This paper presents a comprehensive investigation of fairness and privacy in GNNs, exploring the impact of various fairness-preserving measures on model performance. We conduct experiments across diverse datasets and evaluate the effectiveness of different fairness interventions. Our analysis considers the trade-offs between fairness, privacy, and accuracy, providing insights into the challenges and opportunities in achieving both fair and private graph learning. The results highlight the importance of carefully selecting and combining fairness-preserving measures based on the specific characteristics of the data and the desired fairness objectives. This study contributes to a deeper understanding of the complex interplay between fairness, privacy, and accuracy in GNNs, paving the way for the development of more robust and ethical graph learning models.
中文: 本研究全面探讨了图神经网络中的公平性与隐私问题,通过评估公平性干预措施、隐私保护与模型准确性之间的权衡关系,为开发更符合伦理的图学习系统提供了重要指导。
English: This study thoroughly examines the fairness and privacy challenges in Graph Neural Networks, assessing the trade-offs between fairness interventions, privacy protection, and model accuracy to guide the development of more ethical graph learning systems.

Authors:Bisheng Wei, Ruichen Zhang, Ruihong Jiang, Mugen Peng, Dusit Niyato
Title: LAURA: LLM-Assisted UAV Routing for AoI Minimization
Abstract:
With the rapid growth of the low-altitude economy, there is increasing demand for real-time data collection using UAV-assisted wireless sensor networks. This paper investigates the problem of minimizing the age of information (AoI) in UAV-assisted wireless sensor networks by optimizing the UAV flight routing. We formulate the AoI minimization task and propose a large language model (LLM)-assisted UAV routing algorithm (LAURA). LAURA employs an LLM as intelligent crossover operators within an evolutionary optimization framework to efficiently explore the solution space. Simulation results show that LAURA outperforms benchmark methods in reducing the maximum AoI, especially in scenarios with a large number of sensor nodes.
本文提出LAURA算法,利用大型语言模型辅助无人机路径规划,通过优化飞行路线最小化无人机辅助无线传感器网络中的信息年龄,在大型传感器网络场景下显著降低了最大信息年龄,性能优于基准方法。
This paper introduces LAURA, an LLM-assisted UAV routing algorithm that minimizes the Age of Information in UAV-assisted wireless sensor networks by optimizing flight paths, demonstrating superior performance in reducing maximum AoI, particularly for large-scale networks.

Authors:Shunpu Tang, Yuhao Chen, Qianqian Yang, Ruichen Zhang, Dusit Niyato, Zhiguo Shi
Title: Towards Secure Semantic Communications in the Presence of Intelligent Eavesdroppers
Abstract:
Semantic communication has emerged as a promising paradigm for enhancing communication efficiency in sixth-generation (6G) networks. However, the broadcast nature of wireless channels makes SemCom systems vulnerable to eavesdropping, which poses a serious threat to data privacy. Therefore, we investigate secure SemCom systems that preserve data privacy in the presence of eavesdroppers. Specifically, we first explore a scenario where eavesdroppers are intelligent and can exploit semantic information to reconstruct the transmitted data based on advanced artificial intelligence (AI) techniques. To counter this, we introduce novel eavesdropping attack strategies that utilize model inversion attacks and generative AI (GenAI) models. These strategies effectively reconstruct transmitted private data processed by the semantic encoder, operating in both glass-box and closed-box settings. Existing defense mechanisms against eavesdropping often cause significant distortions in the data reconstructed by eavesdroppers, potentially arousing their suspicion. To address this, we propose a semantic covert communication approach that leverages an invertible neural network (INN)-based signal steganography module. This module covertly embeds the channel input signal of a private sample into that of a non-sensitive host sample, thereby misleading eavesdroppers. Without access to this module, eavesdroppers can only extract host-related information and remain unaware of the hidden private content. We conduct extensive simulations under various channel conditions in image transmission tasks. Numerical results show that while conventional eavesdropping strategies achieve a success rate of over 80\% in reconstructing private information, the proposed semantic covert communication effectively reduces the eavesdropping success rate to 0.
中文: 6G网络中的语义通信面临窃听威胁,但提出的基于可逆神经网络的隐蔽通信方法将私有信号嵌入宿主样本,使窃听成功率从80%以上降至0%。
English: Semantic communication in 6G networks faces eavesdropping risks, but the proposed covert method using invertible neural networks embeds private signals into host samples, reducing eavesdropping success from over 80% to 0%.

Authors:Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Jiayi Ji, Jie Lou, Debing Zhang, Rongrong Ji
Title: MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Abstract:
Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.
中文: MLLM-Selector 是一种自动化方法,通过结合必要性和多样性筛选高质量数据来优化视觉指令调优,在仅使用少量数据的情况下,其性能超越了 LLaVA-1.5 等现有模型。
English: MLLM-Selector is an automated method that enhances visual instruction tuning by selecting high-value data based on necessity and diversity, achieving superior performance with significantly less data compared to existing models like LLaVA-1.5.

Authors:Yongxin Ma, Jie Xu, Shenghai Yuan, Tian Zhi, Wenlu Yu, Jun Zhou, Lihua Xie
Title: MM-LINS: a Multi-Map LiDAR-Inertial System for Over-Degenerate Environments
Abstract:
SLAM plays a crucial role in automation tasks, such as warehouse logistics, healthcare robotics, and restaurant delivery. These scenes come with various challenges, including navigating around crowds of people, dealing with flying plastic bags that can temporarily blind sensors, and addressing reduced LiDAR density caused by cooking smoke. Such scenarios can result in over-degeneracy, causing the map to drift. To address this issue, this paper presents a multi-map LiDAR-inertial system (MM-LINS) for the first time. The front-end employs an iterated error state Kalman filter for state estimation and introduces a reliable evaluation strategy for degeneracy detection. If over-degeneracy is detected, the active map will be stored into sleeping maps. Subsequently, the system continuously attempts to construct new maps using a dynamic initialization method to ensure successful initialization upon leaving the over-degeneracy. Regarding the back-end, the Scan Context descriptor is utilized to detect inter-map similarity. Upon successful recognition of a sleeping map that shares a common region with the active map, the overlapping trajectory region is utilized to constrain the positional transformation near the edge of the prior map. In response to this, a constraint-enhanced map fusion strategy is proposed to achieve high-precision positional and mapping results. Experiments have been conducted separately on both public datasets that exhibited over-degenerate conditions and in real-world environments. These tests demonstrated the effectiveness of MM-LINS in over-degeneracy environment. Our codes are open-sourced on Github.
中文: 本文提出了一种多地图激光雷达惯性系统(MM-LINS),通过退化检测、动态地图初始化和约束增强的融合策略,有效解决了过退化环境中的地图漂移问题,实现了精确定位与建图。
English: This paper introduces a multi-map LiDAR-inertial system (MM-LINS) that addresses map drift in over-degenerate environments by employing degeneracy detection, dynamic map initialization, and constraint-enhanced fusion strategies to maintain accurate positioning and mapping.

Authors:Yaoyao Yu, Leilei Gan, Yinghao Hu, Bin Wei, Kun Kuang, Fei Wu
Title: Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond
Abstract:
Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80\% on seven Chinese legal reasoning tasks and below 80\% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.
中文摘要:尽管DeepSeek-R1和OpenAI o1等测试时缩放大模型在通用推理中表现优异,但在中英文法律推理任务中得分均低于80%,表明其法律推理能力仍有待提升。
English Summary: Test-time scaling LLMs like DeepSeek-R1 and OpenAI o1 show strong general reasoning but perform below 80% on multiple Chinese and English legal tasks, revealing underdeveloped legal reasoning capabilities.

Authors:Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, Rongrong Ji
Title: Grounded Chain-of-Thought for Multimodal Large Language Models
Abstract:
Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.
中文: 本文针对多模态大语言模型的视觉幻觉问题,提出了基于视觉空间推理的GCoT任务,通过构建专用数据集和评估体系,有效提升了模型的视觉基础能力和答案一致性,并能泛化至其他多模态任务。
English: To address visual hallucination in multimodal large language models, this paper introduces Grounded Chain-of-Thought (GCoT), a task that enhances step-by-step visual cue recognition and grounding, supported by a new dataset and evaluation system showing improved consistency and generalization.

Authors:Jiani Fan, Lwin Khin Shar, Ruichen Zhang, Ziyao Liu, Wenzhuo Yang, Dusit Niyato, Bomin Mao, Kwok-Yan Lam
Title: Deep Learning Approaches for Anti-Money Laundering on Mobile Transactions: Review, Framework, and Directions
Abstract:
Money laundering is a financial crime that obscures the origin of illicit funds, necessitating the development and enforcement of anti-money laundering (AML) policies by governments and organizations. The proliferation of mobile payment platforms and smart IoT devices has significantly complicated AML investigations. As payment networks become more interconnected, there is an increasing need for efficient real-time detection to process large volumes of transaction data on heterogeneous payment systems by different operators such as digital currencies, cryptocurrencies and account-based payments. Most of these mobile payment networks are supported by connected devices, many of which are considered loT devices in the FinTech space that constantly generate data. Furthermore, the growing complexity and unpredictability of transaction patterns across these networks contribute to a higher incidence of false positives. While machine learning solutions have the potential to enhance detection efficiency, their application in AML faces unique challenges, such as addressing privacy concerns tied to sensitive financial data and managing the real-world constraint of limited data availability due to data regulations. Existing surveys in the AML literature broadly review machine learning approaches for money laundering detection, but they often lack an in-depth exploration of advanced deep learning techniques - an emerging field with significant potential. To address this gap, this paper conducts a comprehensive review of deep learning solutions and the challenges associated with their use in AML. Additionally, we propose a novel framework that applies the least-privilege principle by integrating machine learning techniques, codifying AML red flags, and employing account profiling to provide context for predictions and enable effective fraud detection under limited data availability....
Chinese: 洗钱掩盖非法资金来源,而移动支付和物联网的兴起使检测复杂化,尽管存在隐私和数据限制,仍需先进的深度学习解决方案。
English: Money laundering conceals illicit funds' origins, and the rise of mobile payments and IoT complicates detection, requiring advanced deep learning solutions despite privacy and data constraints.

Authors:Ruimeng Liu, Xinhang Xu, Shenghai Yuan, Lihua Xie
Title: Handle Object Navigation as Weighted Traveling Repairman Problem
Abstract:
Zero-Shot Object Navigation (ZSON) requires agents to navigate to objects specified via open-ended natural language without predefined categories or prior environmental knowledge. While recent methods leverage foundation models or multi-modal maps, they often rely on 2D representations and greedy strategies or require additional training or modules with high computation load, limiting performance in complex environments and real applications. We propose WTRP-Searcher, a novel framework that formulates ZSON as a Weighted Traveling Repairman Problem (WTRP), minimizing the weighted waiting time of viewpoints. Using a Vision-Language Model (VLM), we score viewpoints based on object-description similarity, projected onto a 2D map with depth information. An open-vocabulary detector identifies targets, dynamically updating goals, while a 3D embedding feature map enhances spatial awareness and environmental recall. WTRP-Searcher outperforms existing methods, offering efficient global planning and improved performance in complex ZSON tasks. Code and design will be open-sourced upon acceptance.
中文摘要:WTRP-Searcher 将零样本目标导航构建为加权旅行修理工问题,通过视觉语言模型和三维地图实现高效全局路径规划,在复杂环境中展现出卓越的导航性能。
English Summary: WTRP-Searcher formulates Zero-Shot Object Navigation as a Weighted Traveling Repairman Problem, using vision-language models and 3D mapping to achieve superior navigation performance through efficient global planning in complex environments.

Authors:Xiaowei Li, Kuan Xu, Fen Liu, Ruofei Bai, Shenghai Yuan, Lihua Xie
Title: AirSwarm: Enabling Cost-Effective Multi-UAV Research with COTS drones
Abstract:
Traditional unmanned aerial vehicle (UAV) swarm missions rely heavily on expensive custom-made drones with onboard perception or external positioning systems, limiting their widespread adoption in research and education. To address this issue, we propose AirSwarm. AirSwarm democratizes multi-drone coordination using low-cost commercially available drones such as Tello or Anafi, enabling affordable swarm aerial robotics research and education. Key innovations include a hierarchical control architecture for reliable multi-UAV coordination, an infrastructure-free visual SLAM system for precise localization without external motion capture, and a ROS-based software framework for simplified swarm development. Experiments demonstrate cm-level tracking accuracy, low-latency control, communication failure resistance, formation flight, and trajectory tracking. By reducing financial and technical barriers, AirSwarm makes multi-robot education and research more accessible. The complete instructions and open source code will be available at
中文摘要:AirSwarm通过采用低成本商用无人机和分层控制系统,无需外部定位设备即可实现精确的无人机集群协调,大幅降低了多机器人研究与教育的门槛。
English Summary: AirSwarm enables affordable UAV swarm research and education using low-cost commercial drones, featuring a hierarchical control system and infrastructure-free visual SLAM for precise coordination without expensive equipment.

Authors:Zhenmin Huang, Ce Hao, Wei Zhan, Jun Ma, Masayoshi Tomizuka
Title: Fair Play in the Fast Lane: Integrating Sportsmanship into Autonomous Racing Systems
Abstract:
Autonomous racing has gained significant attention as a platform for high-speed decision-making and motion control. While existing methods primarily focus on trajectory planning and overtaking strategies, the role of sportsmanship in ensuring fair competition remains largely unexplored. In human racing, rules such as the one-motion rule and the enough-space rule prevent dangerous and unsportsmanlike behavior. However, autonomous racing systems often lack mechanisms to enforce these principles, potentially leading to unsafe maneuvers. This paper introduces a bi-level game-theoretic framework to integrate sportsmanship (SPS) into versus racing. At the high level, we model racing intentions using a Stackelberg game, where Monte Carlo Tree Search (MCTS) is employed to derive optimal strategies. At the low level, vehicle interactions are formulated as a Generalized Nash Equilibrium Problem (GNEP), ensuring that all agents follow sportsmanship constraints while optimizing their trajectories. Simulation results demonstrate the effectiveness of the proposed approach in enforcing sportsmanship rules while maintaining competitive performance. We analyze different scenarios where attackers and defenders adhere to or disregard sportsmanship rules and show how knowledge of these constraints influences strategic decision-making. This work highlights the importance of balancing competition and fairness in autonomous racing and provides a foundation for developing ethical and safe AI-driven racing systems.
中文: 本文提出了一种双层博弈论框架,通过斯塔克伯格博弈建模竞赛意图和广义纳什均衡问题优化轨迹,将体育精神融入自动驾驶赛车中,在保持竞技性的同时有效执行公平竞赛规则。
English: This paper presents a bi-level game-theoretic framework that integrates sportsmanship into autonomous racing by using a Stackelberg game for intention modeling and a Generalized Nash Equilibrium Problem for trajectory optimization, effectively enforcing fair competition rules while maintaining performance.

Authors:Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang
Title: Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Abstract:
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
Chinese: 本文提出了零样本音视频编辑任务,构建了AvED-Bench评估数据集,并开发了AvED框架来实现无需训练的跨模态同步编辑。
English: This paper introduces zero-shot audio-video editing, creates the AvED-Bench evaluation dataset, and proposes the AvED framework to achieve synchronized cross-modal edits without training.

Authors:Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li
Title: Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Abstract:
Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.
中文摘要:本研究首次针对长文本图像生成提出专门解决方案,通过开发新型二进制分词器和多模态自回归模型,在准确性、一致性和灵活性上显著超越现有先进方法,突破了当前生成模型的技术瓶颈。
English Summary: This work introduces the first model specifically designed for long-text image generation, overcoming limitations of existing systems by developing a novel binary tokenizer and a multimodal autoregressive model that significantly outperforms current state-of-the-art methods in accuracy, consistency, and flexibility.

Authors:Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, Lijuan Wang
Title: ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Abstract:
In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80\% performance gain for SEED-X on T2I-ICL tasks. See our project page at https://ImageGen-CoT.github.io/. Code and model weights will be open-sourced.
中文: 本研究提出ImageGen-CoT框架,通过在图像生成前引入结构化思维过程来增强多模态大语言模型在文本到图像上下文学习中的推理能力,通过微调和混合扩展策略使SEED-X模型性能提升80%。
English: This study introduces ImageGen-CoT, a framework that enhances multimodal LLMs' contextual reasoning in Text-to-Image In-Context Learning by generating structured thought processes before image creation, achieving an 80% performance gain on SEED-X through fine-tuning and hybrid scaling strategies.

Authors:Bin Zhang, Jinggang Chen, Xiaoyang Qu, Guokuan Li, Kai Lu, Jiguang Wan, Jing Xiao, Jianzong Wang
Title: RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations
Abstract:
Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
中文: 本文提出RUNA框架,利用预训练的视觉-语言表征和双编码器架构,通过区域不确定性对齐机制显著提升物体级别分布外检测性能,在复杂场景下大幅超越现有方法。
English: This paper introduces RUNA, a novel framework that utilizes pre-trained vision-language representations and a dual encoder architecture with regional uncertainty alignment to significantly enhance object-level out-of-distribution detection, outperforming existing methods in challenging scenarios.

Authors:Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen
Title: Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Abstract:
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.
中文: Dita是一个基于Transformer的可扩展框架,通过多模态扩散直接对连续动作序列进行去噪,实现了与视觉输入的细粒度对齐,并在多样化任务和环境中展现出强大的适应能力。
English: Dita is a scalable Transformer-based framework that uses multimodal diffusion to directly denoise continuous action sequences, enabling fine-grained alignment with visual inputs and robust adaptation across diverse tasks and environments.

Authors:Ali Umut Kaypak, Shiqing Wei, Prashanth Krishnamurthy, Farshad Khorrami
Title: Safe Multi-Robotic Arm Interaction via 3D Convex Shapes
Abstract:
Inter-robot collisions pose a significant safety risk when multiple robotic arms operate in close proximity. We present an online collision avoidance methodology leveraging 3D convex shape-based High-Order Control Barrier Functions (HOCBFs) to address this issue. While prior works focused on using Control Barrier Functions (CBFs) for human-robotic arm and single-arm collision avoidance, we explore the problem of collision avoidance between multiple robotic arms operating in a shared space. In our methodology, we utilize the proposed HOCBFs as centralized and decentralized safety filters. These safety filters are compatible with many nominal controllers and ensure safety without significantly restricting the robots' workspace. A key challenge in implementing these filters is the computational overhead caused by the large number of safety constraints and the computation of a Hessian matrix per constraint. We address this challenge by employing numerical differentiation methods to approximate computationally intensive terms. The effectiveness of our method is demonstrated through extensive simulation studies and real-world experiments with Franka Research 3 robotic arms. The project video is available at this link.
中文:本研究提出了一种基于三维凸形体高阶控制屏障函数的在线避碰方法,通过数值近似技术解决计算瓶颈,有效保障多机械臂在共享空间中的协同作业安全。
English: This study introduces an online collision avoidance method using 3D convex shape-based High-Order Control Barrier Functions (HOCBFs) to prevent collisions between multiple robotic arms in shared workspaces, employing efficient numerical approximations to overcome computational challenges.

Authors:Nimesh Khandelwal, Amritanshu Manu, Shakti S. Gupta, Mangal Kothari, Prashanth Krishnamurthy, Farshad Khorrami
Title: Compliant Control of Quadruped Robots for Assistive Load Carrying
Abstract:
This paper presents a novel method for assistive load carrying using quadruped robots. The controller uses proprioceptive sensor data to estimate external base wrench, that is used for precise control of the robot's acceleration during payload transport. The acceleration is controlled using a combination of admittance control and Control Barrier Function (CBF) based quadratic program (QP). The proposed controller rejects disturbances and maintains consistent performance under varying load conditions. Additionally, the built-in CBF guarantees collision avoidance with the collaborative agent in front of the robot. The efficacy of the overall controller is shown by its implementation on the physical hardware as well as numerical simulations. The proposed control framework aims to enhance the quadruped robot's ability to perform assistive tasks in various scenarios, from industrial applications to search and rescue operations.
中文摘要:本文提出了一种四足机器人辅助载重的新方法,通过本体感知传感器估计外部力矩,结合导纳控制与基于控制屏障函数的二次规划来实现精确加速度控制,确保抗干扰性能和避障能力。
English Summary: This paper introduces a novel control method for quadruped robots that enables precise payload transport by estimating external forces through proprioceptive sensors and combining admittance control with Control Barrier Functions to ensure disturbance rejection and collision avoidance.

Authors:Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
Title: VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Abstract:
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
中文: VisualPRM是一个拥有80亿参数的多模态过程奖励模型,通过最佳N选评估策略提升不同多模态大语言模型的推理能力,并构建了新数据集和基准以推动该领域发展。
English: VisualPRM is an 8-billion-parameter multimodal Process Reward Model that enhances reasoning across various MLLMs through Best-of-N evaluation, achieving significant performance gains and introducing a new dataset and benchmark for future research.

Authors:Junyi Li, Yongqiang Chen, Chenxi Liu, Qianyi Cai, Tongliang Liu, Bo Han, Kun Zhang, Hui Xiong
Title: Can Large Language Models Help Experimental Design for Causal Discovery?
Abstract:
Designing proper experiments and selecting optimal intervention targets is a longstanding problem in scientific or causal discovery. Identifying the underlying causal structure from observational data alone is inherently difficult. Obtaining interventional data, on the other hand, is crucial to causal discovery, yet it is usually expensive and time-consuming to gather sufficient interventional data to facilitate causal discovery. Previous approaches commonly utilize uncertainty or gradient signals to determine the intervention targets. However, numerical-based approaches may yield suboptimal results due to the inaccurate estimation of the guiding signals at the beginning when with limited interventional data. In this work, we investigate a different approach, whether we can leverage Large Language Models (LLMs) to assist with the intervention targeting in causal discovery by making use of the rich world knowledge about the experimental design in LLMs. Specifically, we present Large Language Model Guided Intervention Targeting (LeGIT) -- a robust framework that effectively incorporates LLMs to augment existing numerical approaches for the intervention targeting in causal discovery. Across 4 realistic benchmark scales, LeGIT demonstrates significant improvements and robustness over existing methods and even surpasses humans, which demonstrates the usefulness of LLMs in assisting with experimental design for scientific discovery.
中文: 本研究提出LeGIT框架,利用大语言模型的世界知识来改进因果发现中的干预目标选择,在多个基准测试中展现出优于现有方法及人类专家的性能和鲁棒性。
English: This study introduces LeGIT, a framework that leverages Large Language Models' world knowledge to enhance intervention targeting in causal discovery, demonstrating superior performance and robustness over existing methods and even human experts across multiple benchmarks.

Authors:Zhonghan Zhao, Wenwei Zhang, Haian Huang, Kuikun Liu, Jianfei Gao, Gaoang Wang, Kai Chen
Title: RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
Abstract:
Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.
中文: 本文提出RIG,一种端到端的通用策略,通过协同推理与想象提升具身智能体的效率与泛化能力,实现了超过17倍的样本效率提升,并借助预测结果在行动前进行自我修正。
English: This paper introduces RIG, an end-to-end generalist policy that synergizes reasoning and imagination to enhance embodied agents' efficiency and generalization, achieving over 17x sample efficiency improvements and enabling self-correction before action through predicted outcomes.

Authors:Yuanyuan Wang, Hangting Chen, Dongchao Yang, Weiqin Li, Dan Luo, Guangzhi Li, Shan Yang, Zhiyong Wu, Helen Meng, Xixin Wu
Title: UniSep: Universal Target Audio Separation with Language Models at Scale
Abstract:
We propose Universal target audio Separation (UniSep), addressing the separation task on arbitrary mixtures of different types of audio. Distinguished from previous studies, UniSep is performed on unlimited source domains and unlimited source numbers. We formulate the separation task as a sequence-to-sequence problem, and a large language model (LLM) is used to model the audio sequence in the discrete latent space, leveraging the power of LLM in handling complex mixture audios with large-scale data. Moreover, a novel pre-training strategy is proposed to utilize audio-only data, which reduces the efforts of large-scale data simulation and enhances the ability of LLMs to understand the consistency and correlation of information within audio sequences. We also demonstrate the effectiveness of scaling datasets in an audio separation task: we use large-scale data (36.5k hours), including speech, music, and sound, to train a universal target audio separation model that is not limited to a specific domain. Experiments show that UniSep achieves competitive subjective and objective evaluation results compared with single-task models.
中文:UniSep提出了一种通用音频分离模型,采用序列到序列框架中的大型语言模型处理任意音频混合,支持无限领域和音源数量,并通过大规模数据的创新预训练方法,在主观和客观评估中达到与单任务模型相媲美的性能。
English: UniSep introduces a universal audio separation model that uses a large language model in a sequence-to-sequence framework to handle arbitrary audio mixtures across unlimited domains and source numbers, achieving competitive results with single-task models through novel pre-training on large-scale data.

Authors:Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, kelu Yao, Chao Li, Wenqi Shao, Ping Luo
Title: Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Abstract:
Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.
中文: AIGC的快速发展加剧了虚假媒体的传播,为此推出了Forensics-Bench基准测试,通过多维度选择题全面评估大型视觉语言模型在伪造检测中的识别、定位和推理能力。
English: The rapid advancement of AIGC has escalated the spread of diverse fake media, prompting the development of Forensics-Bench, a comprehensive benchmark to evaluate Large Vision Language Models' capabilities in detecting forgeries through multi-choice questions across various dimensions.

Authors:Runjian Chen, Wenqi Shao, Bo Zhang, Shaoshuai Shi, Li Jiang, Ping Luo
Title: JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
Abstract:
Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM , shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
Chinese: 提出的JiSAM方法有效利用合成LiDAR数据增强自动驾驶感知,仅需2.5%真实标注数据即可达到同等性能,并显著提升对未标注物体的检测能力。
English: The proposed JiSAM method effectively leverages synthetic LiDAR data to enhance autonomous driving perception, achieving comparable performance with only 2.5% of real labeled data while significantly improving detection of unlabeled objects.

Authors:Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen
Title: PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Abstract:
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
中文: 本文提出了HalFscore这一新颖指标,用于在概念层面评估密集图像描述的质量,并开发了PerturboLLaVA方法,通过在训练中加入对抗性扰动文本来减少多模态大语言模型的幻觉,增强视觉关注且不增加计算负担。
English: This paper introduces HalFscore, a novel metric for evaluating dense image caption quality at the concept level, and proposes PerturboLLaVA, a method that reduces hallucinations in MLLMs by incorporating adversarially perturbed text during training to enhance visual focus without extra computational cost.

Authors:Fengbin Zhu, Junfeng Li, Liangming Pan, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, Tat-Seng Chua
Title: Towards Temporal-Aware Multi-Modal Retrieval Augmented Generation in Finance
Abstract:
Finance decision-making often relies on in-depth data analysis across various data sources, including financial tables, news articles, stock prices, etc. In this work, we introduce FinTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages. 1) Multi-modal Corpus: It encompasses a hybrid of financial tables, news articles, daily stock prices, and visual technical charts as the corpus. 2) Temporal-aware Questions: Each question requires the retrieval and interpretation of its relevant data over a specific time period, including daily, weekly, monthly, quarterly, and annual periods. 3) Diverse Financial Analysis Tasks: The questions involve 10 different financial analysis tasks designed by domain experts, including information extraction, trend analysis, sentiment analysis and event detection, etc. We further propose a novel TMMHybridRAG method, which first leverages LLMs to convert data from other modalities (e.g., tabular, visual and time-series data) into textual format and then incorporates temporal information in each node when constructing graphs and dense indexes. Its effectiveness has been validated in extensive experiments, but notable gaps remain, highlighting the challenges presented by our FinTMMBench.
中文: 本文提出了首个用于评估金融领域时序感知多模态检索增强生成系统的综合基准FinTMMBench,其包含混合语料库和多样化金融任务,并提出新颖的TMMHybridRAG方法,能有效整合多模态数据与时间信息。
English: This paper introduces FinTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal RAG systems in finance, featuring a hybrid corpus and diverse financial tasks, and proposes a novel TMMHybridRAG method that effectively integrates multi-modal data with temporal information.

Authors:Yiyan Xu, Jinghao Zhang, Alireza Salemi, Xinting Hu, Wenjie Wang, Fuli Feng, Hamed Zamani, Xiangnan He, Tat-Seng Chua
Title: Personalized Generation In Large Model Era: A Survey
Abstract:
In the era of large models, content generation is gradually shifting to Personalized Generation (PGen), tailoring content to individual preferences and needs. This paper presents the first comprehensive survey on PGen, investigating existing research in this rapidly growing field. We conceptualize PGen from a unified perspective, systematically formalizing its key components, core objectives, and abstract workflows. Based on this unified perspective, we propose a multi-level taxonomy, offering an in-depth review of technical advancements, commonly used datasets, and evaluation metrics across multiple modalities, personalized contexts, and tasks. Moreover, we envision the potential applications of PGen and highlight open challenges and promising directions for future exploration. By bridging PGen research across multiple modalities, this survey serves as a valuable resource for fostering knowledge sharing and interdisciplinary collaboration, ultimately contributing to a more personalized digital landscape.
中文摘要:本文首次对个性化生成领域进行全面综述,通过统一视角系统化其核心要素,并跨模态评述技术进展与评估体系,为构建个性化数字生态提供重要参考。
English Summary: This paper provides the first comprehensive survey on Personalized Generation (PGen), establishing a unified framework to analyze its components and workflows while reviewing technical advances, datasets, and evaluation metrics across multiple modalities.

Authors:Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng
Title: Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
Abstract:
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.
中文: GenDiE是一种自进化框架,通过细粒度的句子级优化提升大语言模型的上下文忠实度,结合生成与判别训练,在长形式问答和反事实场景中展现出卓越性能。
English: GenDiE is a self-evolving framework that enhances context faithfulness in large language models through fine-grained sentence-level optimization, combining generative and discriminative training to improve performance in long-form question answering and counterfactual scenarios.

Authors:Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Tat-Seng Chua, Ting Liu
Title: Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities
Abstract:
Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 671B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.
Chinese: 大型推理模型的最新进展表明,尽管它们在复杂推理任务中表现出色,但这些能力是以降低基础性能和增加计算成本为代价的,凸显了采用自适应推理方法来平衡效率与效果的必要性。
English: Recent advances in Large Reasoning Models show that while they excel in complex reasoning tasks, these capabilities come at the cost of reduced foundational performance and higher computational demands, highlighting the need for adaptive reasoning approaches to balance efficiency and effectiveness.

Authors:Yihan Hu, Jianing Peng, Yiheng Lin, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei
Title: DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics
Abstract:
This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.
中文: 本文提出了一种基于扩散模型的文本引导图像编辑新方法,通过精确语义定位策略和双级控制机制提升编辑精度,在多个基准测试中展现出优越性能。
English: This paper introduces a novel diffusion-based method for text-guided image editing that enhances precision through a Precise Semantic Localization strategy and Dual-Level Control mechanism, validated by superior performance on established benchmarks.

Authors:Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu
Title: Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
Abstract:
The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it rigidly trains models on single gold-standard responses without modeling nuanced strategy trade-offs. To overcome these limitations, we propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes strategy selection preferences at each dialogue turn. We first leverage Monte Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both strategy accuracy and bias mitigation, enabling LLMs to generate more empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B, Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT, highlighting the efficacy of fine-grained, turn-level preference modeling in ESC.
中文: 提出的策略链优化方法通过逐轮偏好建模,提升了大型语言模型在情感支持对话中的策略选择准确性并减少偏见,在多模型实验中优于标准监督微调方法。
English: The proposed Chain-of-Strategy Optimization (CSO) method enhances emotional support in LLMs by improving strategy selection accuracy and reducing bias through turn-level preference modeling, outperforming standard supervised fine-tuning across multiple models.

Authors:Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, Dong Yu
Title: Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.
中文: RLVR通过生成式评分提供灵活奖励,在无结构化答案的多领域任务中显著提升大语言模型性能,超越顶尖开源模型表现。
English: RLVR enhances LLM performance across diverse domains using generative scoring for flexible rewards, achieving superior results over top models in unstructured settings.

Authors:Jialiang Tang, Shuo Chen, Chen Gong, Jing Zhang, Dacheng Tao
Title: LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics
Abstract:
Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring. Recent studies have revealed that Large Language Models (LLMs), with their powerful in-contextual modeling capabilities, hold significant potential for TSF. However, existing LLM-based methods usually perform suboptimally because they neglect the inherent characteristics of time series data. Unlike the textual data used in LLM pre-training, the time series data is semantically sparse and comprises distinctive temporal patterns. To address this problem, we propose LLM-PS to empower the LLM for TSF by learning the fundamental \textit{Patterns} and meaningful \textit{Semantics} from time series data. Our LLM-PS incorporates a new multi-scale convolutional neural network adept at capturing both short-term fluctuations and long-term trends within the time series. Meanwhile, we introduce a time-to-text module for extracting valuable semantics across continuous time intervals rather than isolated time points. By integrating these patterns and semantics, LLM-PS effectively models temporal dependencies, enabling a deep comprehension of time series and delivering accurate forecasts. Intensive experimental results demonstrate that LLM-PS achieves state-of-the-art performance in both short- and long-term forecasting tasks, as well as in few- and zero-shot settings.
中文: LLM-PS通过多尺度卷积网络和时间到文本转换学习时间序列的模式和语义,从而在多种预测场景中实现最先进的性能,提升了时间序列预测的准确性。
English: LLM-PS enhances time series forecasting by learning temporal patterns and semantics through multi-scale convolutional networks and time-to-text conversion, achieving state-of-the-art results across various forecasting scenarios.

Authors:Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang
Title: Unified Reward Model for Multimodal Understanding and Generation
Abstract:
Recent advances in human preference alignment have significantly enhanced multimodal generation and understanding. A key approach is training reward models to guide preference optimization. However, existing models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that jointly learning to assess multiple tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment. Specifically, (1) we first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks. (2) Then, it is utilized to automatically construct high-quality preference pair data based on the vision models, fine-gradually filtering their outputs through pair ranking and point sifting. (3) Finally, these data are used for their preference alignment through Direct Preference Optimization (DPO). Experimental results demonstrate that joint learning to assess diverse visual tasks can lead to substantial mutual benefits and we apply our pipeline to both image and video understanding/generation tasks, significantly improving the performance in each domain.
中文摘要:本文提出首个统一奖励模型UnifiedReward,通过联合学习多种视觉任务实现多模态理解与生成评估,实验证明该方法能在图像和视频领域产生显著协同效应并有效提升各领域性能。
English Summary: This paper introduces UnifiedReward, a unified reward model for multimodal understanding and generation assessment that enables joint learning across diverse visual tasks, demonstrating substantial mutual benefits through experimental validation and application in both image and video domains.

Authors:Ke Ji, Jiahao Xu, Tian Liang, Qiuzhi Liu, Zhiwei He, Xingyu Chen, Xiaoyuan Liu, Zhijie Wang, Junying Chen, Benyou Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models
Abstract:
Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model's structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.
中文: UPFT是一种无监督微调方法,仅通过训练初始前缀子串即可提升大语言模型的推理能力,在达到与监督方法相当性能的同时,显著降低了训练时间和采样成本。
English: UPFT is an unsupervised fine-tuning method that enhances LLM reasoning by training only on initial prefix substrings, achieving performance comparable to supervised methods while drastically reducing training time and sampling costs.

Authors:Jialin Wan, Jinglong Shen, Nan Cheng, Zhisheng Yin, Yiliang Liu, Wenchao Xu, Xuemin, Shen
Title: A Channel-Triggered Backdoor Attack on Wireless Semantic Image Reconstruction
Abstract:
This paper investigates backdoor attacks in image-oriented semantic communications. The threat of backdoor attacks on symbol reconstruction in semantic communication (SemCom) systems has received limited attention. Previous research on backdoor attacks targeting SemCom symbol reconstruction primarily focuses on input-level triggers, which are impractical in scenarios with strict input constraints. In this paper, we propose a novel channel-triggered backdoor attack (CT-BA) framework that exploits inherent wireless channel characteristics as activation triggers. Our key innovation involves utilizing fundamental channel statistics parameters, specifically channel gain with different fading distributions or channel noise with different power, as potential triggers. This approach enhances stealth by eliminating explicit input manipulation, provides flexibility through trigger selection from diverse channel conditions, and enables automatic activation via natural channel variations without adversary intervention. We extensively evaluate CT-BA across four joint source-channel coding (JSCC) communication system architectures and three benchmark datasets. Simulation results demonstrate that our attack achieves near-perfect attack success rate (ASR) while maintaining effective stealth. Finally, we discuss potential defense mechanisms against such attacks.
中文摘要:本文针对语义通信系统提出了一种新型信道触发后门攻击框架,利用无线信道固有特性作为激活触发器,无需显式输入操控即可实现高攻击成功率并保持隐蔽性。
English Summary: This paper introduces a novel channel-triggered backdoor attack framework for semantic communication systems that uses inherent wireless channel characteristics as activation triggers, achieving high attack success while maintaining stealth without explicit input manipulation.

Authors:Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Xingguang Wei, Haomin Wang, YanPeng Li, Fu Chai, Fei Cheng, Shenglong Ye, Wenhai Wang, Yanting Zhang, Yu Qiao, Hongjie Zhang, Xianzhong Zhao
Title: ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting
Abstract:
Recognizing symbols in architectural CAD drawings is critical for various advanced engineering applications. In this paper, we propose a novel CAD data annotation engine that leverages intrinsic attributes from systematically archived CAD drawings to automatically generate high-quality annotations, thus significantly reducing manual labeling efforts. Utilizing this engine, we construct ArchCAD-400K, a large-scale CAD dataset consisting of 413,062 chunks from 5538 highly standardized drawings, making it over 26 times larger than the largest existing CAD dataset. ArchCAD-400K boasts an extended drawing diversity and broader categories, offering line-grained annotations. Furthermore, we present a new baseline model for panoptic symbol spotting, termed Dual-Pathway Symbol Spotter (DPSS). It incorporates an adaptive fusion module to enhance primitive features with complementary image features, achieving state-of-the-art performance and enhanced robustness. Extensive experiments validate the effectiveness of DPSS, demonstrating the value of ArchCAD-400K and its potential to drive innovation in architectural design and construction.
中文: 本文提出了一种新型CAD标注引擎,能自动生成高质量标签并构建了比现有最大数据集大26倍的ArchCAD-400K大规模数据集,同时开发了双通路符号识别模型,在建筑图纸的全景符号识别中实现了最优性能。
English: This paper introduces a novel CAD annotation engine that automatically generates high-quality labels and creates ArchCAD-400K, a large-scale dataset 26 times bigger than existing ones, along with a Dual-Pathway Symbol Spotter model that achieves state-of-the-art performance for panoptic symbol recognition in architectural drawings.

Authors:Hongru Cai, Yongqi Li, Ruifeng Yuan, Wenjie Wang, Zhen Zhang, Wenjie Li, Tat-Seng Chua
Title: Exploring Training and Inference Scaling Laws in Generative Retrieval
Abstract:
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models (LLMs) generate target documents directly from a query. As a novel paradigm, the mechanisms that underpin its performance and scalability remain largely unexplored. We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance. We propose a novel evaluation metric inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods align strongly with training and inference scaling laws. We find that increasing model size, training data scale, and inference-time compute all contribute to improved performance, highlighting the complementary roles of these factors in enhancing generative retrieval. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.
中文摘要:生成式检索利用大型语言模型直接从查询生成文档,其性能随模型规模、训练数据和推理计算量的增加而提升,其中LLaMA模型表现优于T5模型。
English Summary: Generative retrieval leverages large language models to directly generate documents from queries, with performance scaling positively with model size, training data, and inference compute, where LLaMA models excel over T5.

Authors:Runqi Meng, Sifan Song, Pengfei Jin, Yujin Oh, Lin Teng, Yulin Wang, Yiqun Sun, Ling Chen, Xiang Li, Quanzheng Li, Ning Guo, Dinggang Shen
Title: MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts
Abstract:
Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.
Chinese: 提出的MAST-Pro框架结合知识驱动提示与动态专家混合机制,在提升全肿瘤分割精度的同时显著降低计算成本,实现了5.20%的DSC提升和91.04%的参数削减。
English: The proposed MAST-Pro framework integrates knowledge-driven prompts and dynamic Mixture-of-Experts to enhance pan-tumor segmentation accuracy while reducing computational costs, achieving a 5.20% DSC improvement and 91.04% parameter reduction.

Authors:Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Ruben Vera-Rodriguez
Title: MINT-Demo: Membership Inference Test Demonstrator
Abstract:
We present the Membership Inference Test Demonstrator, to emphasize the need for more transparent machine learning training processes. MINT is a technique for experimentally determining whether certain data has been used during the training of machine learning models. We conduct experiments with popular face recognition models and 5 public databases containing over 22M images. Promising results, up to 89% accuracy are achieved, suggesting that it is possible to recognize if an AI model has been trained with specific data. Finally, we present a MINT platform as demonstrator of this technology aimed to promote transparency in AI training.
Chinese: 我们推出了成员推断测试演示器(MINT),这是一种检测特定数据是否用于机器学习模型训练的技术,在面部识别模型和超过2200万张图像的实验中准确率高达89%,旨在促进人工智能训练过程的透明度。
English: The Membership Inference Test Demonstrator (MINT) is introduced as a technique to detect if specific data was used in training machine learning models, achieving up to 89% accuracy in experiments with face recognition models and over 22 million images, promoting transparency in AI training processes.

Authors:Gonzalo Mancera, Daniel DeAlcala, Julian Fierrez, Ruben Tolosana, Aythami Morales
Title: Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs
Abstract:
This work adapts and studies the gradient-based Membership Inference Test (gMINT) to the classification of text based on LLMs. MINT is a general approach intended to determine if given data was used for training machine learning models, and this work focuses on its application to the domain of Natural Language Processing. Using gradient-based analysis, the MINT model identifies whether particular data samples were included during the language model training phase, addressing growing concerns about data privacy in machine learning. The method was evaluated in seven Transformer-based models and six datasets comprising over 2.5 million sentences, focusing on text classification tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores between 85% and 99%, depending on data size and model architecture. These findings highlight MINTs potential as a scalable and reliable tool for auditing machine learning models, ensuring transparency, safeguarding sensitive data, and fostering ethical compliance in the deployment of AI/NLP technologies.
中文摘要:本研究将基于梯度的成员推理测试(gMINT)应用于大型语言模型的文本分类训练数据检测,通过在七个Transformer模型和六个数据集上的实验验证了该方法85%-99%的AUC性能表现,为机器学习模型审计和数据隐私保护提供了可靠工具。
English Summary: This study adapts the gradient-based Membership Inference Test (gMINT) to assess whether specific text data was used in training large language models, demonstrating high effectiveness with AUC scores of 85-99% across multiple models and datasets to address data privacy concerns in NLP.

Authors:Xiaolong Li, Jianhao Wei, Haidong Wang, Li Dong, Ruoyang Chen, Changyan Yi, Jun Cai, Dusit Niyato, Xuemin, Shen
Title: Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework
Abstract:
In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in-the-loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real-world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. Specifically, SVFDT builds comprehensive pedestrian-vehicle interaction models by leveraging multi-source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT's system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal-server frameworks highlight SV-FDT's advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.
Chinese: 本文提出了一种基于监控视频的联邦数字孪生(SV-FDT)框架,通过多源交通监控视频构建人车交互模型,采用三层架构实现实时全局数字孪生,并在交通管理优化中验证了其有效性。
English: The article introduces a surveillance video assisted federated digital twin (SV-FDT) framework to enhance intelligent transportation systems by modeling pedestrian-vehicle interactions using multi-source surveillance videos, with a three-layer architecture enabling real-time global digital twin construction and demonstrating effectiveness in traffic management optimization.

Authors:Yujin Oh, Robert Seifert, Yihan Cao, Christoph Clement, Justin Ferdinandus, Constantin Lapa, Alessandro Liebich, Michelle Amon, Johanna Enke, Sifan Song, Runqi Meng, Fang Zeng, Ning Guo, Xiang Li, Pedram Heidari, Axel Rominger, Kuangyu Shi, Quanzheng Li
Title: Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging
Abstract:
In oncology, Positron Emission Tomography-Computed Tomography (PET/CT) is widely used in cancer diagnosis, staging, and treatment monitoring, as it combines anatomical details from CT with functional metabolic activity and molecular marker expression information from PET. However, existing artificial intelligence-driven PET/CT analyses rely predominantly on task-specific models trained from scratch or on limited datasets, limiting their generalizability and robustness. To address this, we propose a foundation model approach specifically designed for multimodal PET/CT imaging. We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional or molecular information. FratMAE employs separate Vision Transformer (ViT) encoders for PET and CT scans, along with cross-attention decoders that enable synergistic interactions between modalities during masked autoencoder training. Additionally, it incorporates textual metadata to enhance PET representation learning. By pre-training on PET/CT datasets, FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks and demonstrating its potential as a generalizable foundation model.
中文: 本研究提出FratMAE基础模型,通过交叉注意力机制和文本元数据整合PET与CT成像,能捕捉精细的跨模态关联,在肿瘤学下游任务中展现出卓越性能。
English: The study introduces FratMAE, a foundation model that integrates PET and CT imaging through cross-attention mechanisms and text metadata, enabling superior performance in oncology tasks by capturing detailed cross-modal relationships.

Authors:Zihao Zeng, Chubo Liu, Xin He, Juan Hu, Yong Jiang, Fei Huang, Kenli Li, Wei Yang Bryan Lim
Title: AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs
Abstract:
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation, with improvements scaling proportionally with model size. However, the limitations of GPU memory have restricted LLM training accessibility for many researchers. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. In this work, we propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single-GPU and multi-GPU environments. AutoHete dynamically adjusts activation checkpointing, parameter offloading, and optimizer offloading based on the specific hardware configuration and LLM training needs. Additionally, we design a priority-based scheduling mechanism that maximizes the overlap between operations across training iterations, enhancing throughput. Compared to state-of-the-art heterogeneous training systems, AutoHete delivers a 1.32x~1.91x throughput improvement across various model sizes and training configurations.
中文:AutoHete是一种高效的异构训练系统,通过动态优化激活检查点和参数卸载,在降低通信开销的同时将大语言模型训练吞吐量提升了1.32~1.91倍。
English: AutoHete is an efficient heterogeneous training system that dynamically optimizes activation checkpointing and parameter offloading to enhance LLM training throughput by 1.32x~1.91x while reducing communication overhead.

Authors:Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, Enhong Chen
Title: RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery
Abstract:
Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as direct generation and multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient retrieval-augmented long text generation framework. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (e.g. long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
中文: 提出的RAPID框架通过检索增强提纲生成、属性约束搜索和规划引导写作,有效解决了知识密集型长文本生成的难题,在FreshWiki-2024基准测试中各项指标均显著优于现有方法。
English: The proposed RAPID framework effectively addresses challenges in knowledge-intensive long-text generation by integrating retrieval-augmented outline generation, attribute-constrained search, and plan-guided writing, demonstrating superior performance across multiple metrics on the FreshWiki-2024 benchmark.

Authors:Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Title: Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
Abstract:
To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/
中文摘要:Any2Caption是一种创新框架,通过多模态大语言模型将多样化输入解析为结构化描述,从而提升视频生成的可控性和画面质量。
English Summary: Any2Caption is a novel framework that enhances video generation by interpreting diverse inputs into structured captions using multimodal large language models, thereby improving controllability and video quality.

Authors:Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, Pingfa Feng
Title: HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment
Abstract:
Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression heads. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/
中文: 本研究首创了专门用于人体图像美学评估的HumanBeauty数据集,并开发了采用多头部架构的HumanAesExpert视觉语言模型,该模型在人体图像的整体与细粒度美学评估中均显著优于现有最优方法。
English: This study introduces HumanBeauty, the first specialized dataset for Human Image Aesthetic Assessment (HIAA), and proposes HumanAesExpert, a vision-language model with an innovative multi-head architecture that significantly outperforms existing methods in both overall and fine-grained aesthetic evaluation of human images.

Authors:Yifei Yang, Lu Chen, Zherui Song, Yenan Chen, Wentao Sun, Zhongxiang Zhou, Rong Xiong, Yue Wang
Title: Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning
Abstract:
Grasp-based manipulation tasks are fundamental to robots interacting with their environments, yet gripper state ambiguity significantly reduces the robustness of imitation learning policies for these tasks. Data-driven solutions face the challenge of high real-world data costs, while simulation data, despite its low costs, is limited by the sim-to-real gap. We identify the root cause of gripper state ambiguity as the lack of tactile feedback. To address this, we propose a novel approach employing pseudo-tactile as feedback, inspired by the idea of using a force-controlled gripper as a tactile sensor. This method enhances policy robustness without additional data collection and hardware involvement, while providing a noise-free binary gripper state observation for the policy and thus facilitating pure simulation learning to unleash the power of simulation. Experimental results across three real-world grasp-based tasks demonstrate the necessity, effectiveness, and efficiency of our approach.
中文摘要:本研究针对机器人抓取操作中的夹爪状态模糊问题,提出了一种将力控夹爪作为触觉传感器的伪触觉反馈方法,无需额外数据和硬件即可实现鲁棒的仿真到现实学习。
English Summary: This study addresses gripper state ambiguity in robot manipulation by introducing a pseudo-tactile feedback method that uses force-controlled grippers as tactile sensors, enabling robust simulation-to-real learning without additional data or hardware.

Authors:Feng-Lin Liu, Hongbo Fu, Xintao Wang, Weicai Ye, Pengfei Wan, Di Zhang, Lin Gao
Title: SketchVideo: Sketch-based Video Generation and Editing
Abstract:
Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.
中文摘要:本文提出SketchVideo模型,通过创新的帧间注意力机制和视频插入模块,实现了基于草图的空间与运动控制的视频生成与编辑,在保持高效内存使用的同时确保编辑区域与原始视频的一致性。
English Summary: This paper introduces SketchVideo, a memory-efficient model that enables sketch-based spatial and motion control for video generation and editing through novel inter-frame attention and video insertion modules.

Authors:Nan Gao, Yihua Bao, Dongdong Weng, Jiayi Zhao, Jia Li, Yan Zhou, Pengfei Wan, Di Zhang
Title: SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain
Abstract:
Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
Chinese: SARGes框架利用大型语言模型解析语音并生成语义手势标签,从而高效合成具有上下文感知的、意义丰富的手势,实现了高准确性和快速推理。
English: The SARGes framework utilizes large language models to parse speech and generate semantic gesture labels, enabling the synthesis of meaningful and context-aware co-speech gestures with high accuracy and efficiency.

Authors:Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu
Title: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
Abstract:
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
中文: FullDiT是一种通过全注意力机制整合多条件的统一视频生成模型,它减少了参数冗余并避免了条件冲突,在复杂多任务场景中实现了最先进的性能。
English: FullDiT is a unified video generation model that integrates multiple conditions through full-attention mechanisms, reducing parameter redundancy and avoiding conflicts while achieving state-of-the-art performance in complex multi-task scenarios.

Authors:Cong Liu, Liang Hou, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
Title: Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings
Abstract:
Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.
中文: 本文提出了一种新颖的二维随机位置编码(RPE-2D)框架,通过专注于学习图像块的位置顺序而非具体距离,实现了无需高低分辨率训练即可无缝生成高低分辨率图像,并在ImageNet数据集上达到了最先进的分辨率泛化性能。
English: This paper introduces a novel two-dimensional randomized positional encodings (RPE-2D) framework that learns positional order of image patches rather than specific distances, enabling seamless high- and low-resolution image generation without requiring corresponding training resolutions, while achieving state-of-the-art resolution generalization performance on ImageNet.

Authors:Yuxuan Xie, Xuan Yu, Changjian Jiang, Sitong Mao, Shunbo Zhou, Rui Fan, Rong Xiong, Yue Wang
Title: PanopticSplatting: End-to-End Panoptic Gaussian Splatting
Abstract:
Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.
中文摘要:PanopticSplatting是一个用于开放词汇全景重建的端到端系统,通过引入基于查询的高斯分割与局部交叉注意力机制及标签融合技术,在多个数据集上实现了优越性能,并能兼容不同高斯溅射基础模型。
English Summary: PanopticSplatting is an end-to-end system for open-vocabulary panoptic reconstruction that introduces query-guided Gaussian segmentation with local cross attention and label blending techniques, achieving superior performance on multiple datasets while being compatible with various Gaussian splatting models.

Authors:Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
Title: Position: Interactive Generative Video as Next-Generation Game Engine
Abstract:
Modern game development faces significant challenges in creativity and cost due to predetermined content in traditional game engines. Recent breakthroughs in video generation models, capable of synthesizing realistic and interactive virtual environments, present an opportunity to revolutionize game creation. In this position paper, we propose Interactive Generative Video (IGV) as the foundation for Generative Game Engines (GGE), enabling unlimited novel content generation in next-generation gaming. GGE leverages IGV's unique strengths in unlimited high-quality content synthesis, physics-aware world modeling, user-controlled interactivity, long-term memory capabilities, and causal reasoning. We present a comprehensive framework detailing GGE's core modules and a hierarchical maturity roadmap (L0-L4) to guide its evolution. Our work charts a new course for game development in the AI era, envisioning a future where AI-powered generative systems fundamentally reshape how games are created and experienced.
中文摘要:本文提出将交互式生成视频(IGV)作为生成式游戏引擎(GGE)的核心,通过分层成熟度路线图和系统框架,利用AI生成无限内容以突破传统游戏开发的创意限制,重塑游戏创作与体验方式。
English Summary: This paper introduces Interactive Generative Video (IGV) as the core of Generative Game Engines (GGE) to overcome creative limitations in traditional game development by enabling unlimited AI-generated content through a structured framework and maturity roadmap.

Authors:Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Yuan Zhang, Pengfei Wan, Di Zhang, Shuai Li
Title: Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control
Abstract:
Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/
中文:Cafe-Talk是一种基于扩散变换器的3D说话人脸模型,通过两阶段训练流程和多模态控制条件,实现了卓越的唇形同步与细腻表情调控,支持从粗粒度到细粒度的面部动画精确控制。
English: Cafe-Talk is a diffusion-transformer-based 3D talking face model that achieves superior lip synchronization and expressive control through a two-stage training pipeline and multimodal conditions, enabling both coarse and fine-grained facial animation adjustments.

Authors:Kechun Xu, Xunlong Xia, Kaixuan Wang, Yifei Yang, Yunxuan Mao, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang
Title: Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Abstract:
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.
中文: 本文提出A²方法,通过整合视觉、语言和动作先验,实现了在杂乱环境中高效执行语言引导的抓取放置任务,仅需少量数据即可泛化至未见过的物体和指令。
English: This paper introduces A², an action prior alignment method that integrates vision, language, and action priors to enable efficient language-conditioned pick-and-place tasks in clutter with minimal data and strong generalization to unseen objects and instructions.

Authors:Haodong Zhang, Liang Zhang, Zhenghan Chen, Lu Chen, Yue Wang, Rong Xiong
Title: Natural Humanoid Robot Locomotion with Generative Motion Prior
Abstract:
Natural and lifelike locomotion remains a fundamental challenge for humanoid robots to interact with human society. However, previous methods either neglect motion naturalness or rely on unstable and ambiguous style rewards. In this paper, we propose a novel Generative Motion Prior (GMP) that provides fine-grained motion-level supervision for the task of natural humanoid robot locomotion. To leverage natural human motions, we first employ whole-body motion retargeting to effectively transfer them to the robot. Subsequently, we train a generative model offline to predict future natural reference motions for the robot based on a conditional variational auto-encoder. During policy training, the generative motion prior serves as a frozen online motion generator, delivering precise and comprehensive supervision at the trajectory level, including joint angles and keypoint positions. The generative motion prior significantly enhances training stability and improves interpretability by offering detailed and dense guidance throughout the learning process. Experimental results in both simulation and real-world environments demonstrate that our method achieves superior motion naturalness compared to existing approaches. Project page can be found at https://sites.google.com/view/humanoid-gmp
中文摘要:本文提出了一种生成式运动先验(GMP)方法,通过全身运动重定向和条件变分自编码器利用人类自然运动数据,为仿人机器人提供精细化的轨迹级监督,在仿真和真实环境中均实现了比现有方法更优越的运动自然度。
English Summary: This paper introduces a Generative Motion Prior (GMP) that leverages human motion data through retargeting and a conditional variational auto-encoder to provide detailed trajectory-level supervision, significantly improving the naturalness and stability of humanoid robot locomotion in both simulations and real-world tests.

Authors:Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Yeying Jin, Zhaoxin Fan, Hongyan Liu, Jun He
Title: ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis
Abstract:
Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation. Despite notable progress, existing methods often produce gestures that are coarse, lack expressiveness, and fail to fully align with audio semantics. To address these challenges, we propose ExGes, a novel retrieval-enhanced diffusion framework with three key designs: (1) a Motion Base Construction, which builds a gesture library using training dataset; (2) a Motion Retrieval Module, employing constrative learning and momentum distillation for fine-grained reference poses retreiving; and (3) a Precision Control Module, integrating partial masking and stochastic masking to enable flexible and fine-grained control. Experimental evaluations on BEAT2 demonstrate that ExGes reduces Fréchet Gesture Distance by 6.2\% and improves motion diversity by 5.3\% over EMAGE, with user studies revealing a 71.3\% preference for its naturalness and semantic relevance. Code will be released upon acceptance.
Chinese: 提出的ExGes框架通过整合动作库、检索模块和精确控制,提升了音频驱动手势合成的自然度和语义一致性,在评估中展现出更优的性能表现。
English: The proposed ExGes framework enhances audio-driven gesture synthesis by integrating a motion base, retrieval module, and precision control to generate more natural and semantically aligned gestures, achieving superior performance in evaluations.

Authors:Zhen Yang, Guibao Shen, Minyang Li, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen
Title: Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models
Abstract:
Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.
中文: 本文提出无需训练的RectifiedHR方法,通过噪声刷新策略实现高效高分辨率图像生成,并针对能量衰减现象调整分类器引导参数以提升画质,同时兼容多种扩散模型的高级应用功能。
English: This paper introduces RectifiedHR, a training-free method that enables efficient high-resolution image generation through a noise refresh strategy and addresses energy decay by adjusting classifier-free guidance, while maintaining compatibility with various diffusion model applications.

Authors:Anzhe Chen, Hongxiang Yu, Shuxin Li, Yuxi Chen, Zhongxiang Zhou, Wentao Sun, Rong Xiong, Yue Wang
Title: CNSv2: Probabilistic Correspondence Encoded Neural Image Servo
Abstract:
Visual servo based on traditional image matching methods often requires accurate keypoint correspondence for high precision control. However, keypoint detection or matching tends to fail in challenging scenarios with inconsistent illuminations or textureless objects, resulting significant performance degradation. Previous approaches, including our proposed Correspondence encoded Neural image Servo policy (CNS), attempted to alleviate these issues by integrating neural control strategies. While CNS shows certain improvement against error correspondence over conventional image-based controllers, it could not fully resolve the limitations arising from poor keypoint detection and matching. In this paper, we continue to address this problem and propose a new solution: Probabilistic Correspondence Encoded Neural Image Servo (CNSv2). CNSv2 leverages probabilistic feature matching to improve robustness in challenging scenarios. By redesigning the architecture to condition on multimodal feature matching, CNSv2 achieves high precision, improved robustness across diverse scenes and runs in real-time. We validate CNSv2 with simulations and real-world experiments, demonstrating its effectiveness in overcoming the limitations of detector-based methods in visual servo tasks.
中文: 提出的CNSv2方法通过采用概率特征匹配来克服关键点检测在复杂环境中的失效问题,实现了跨场景的鲁棒实时视觉伺服控制。
English: The proposed CNSv2 method enhances visual servo control by employing probabilistic feature matching to overcome keypoint detection failures in challenging environments, achieving robust real-time performance across diverse scenarios.

Authors:Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang
Title: Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Abstract:
Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
中文: 现代视觉语言模型在字体识别方面能力有限,难以应对细粒度任务且易受文本信息干扰,结构化基准评估揭示了其固有局限性。
English: Modern Vision-Language Models show limited font recognition capabilities, struggling with fine-grained tasks and being easily disrupted by textual interference, as revealed through a structured benchmark evaluation.

Authors:Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang
Title: Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Abstract:
Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
中文: 现代视觉语言模型在字体识别方面能力有限,难以应对细粒度任务且易受文本信息干扰,结构化基准评估揭示了其固有局限性。
English: Modern Vision-Language Models show limited font recognition capabilities, struggling with fine-grained tasks and being easily disrupted by textual interference, as revealed through a structured benchmark evaluation.

Authors:Cheng Wang, Yiwei Wang, Yujun Cai, Bryan Hooi
Title: Tricking Retrievers with Influential Tokens: An Efficient Black-Box Corpus Poisoning Attack
Abstract:
Retrieval-augmented generation (RAG) systems enhance large language models by incorporating external knowledge, addressing issues like outdated internal knowledge and hallucination. However, their reliance on external knowledge bases makes them vulnerable to corpus poisoning attacks, where adversarial passages can be injected to manipulate retrieval results. Existing methods for crafting such passages, such as random token replacement or training inversion models, are often slow and computationally expensive, requiring either access to retriever's gradients or large computational resources. To address these limitations, we propose Dynamic Importance-Guided Genetic Algorithm (DIGA), an efficient black-box method that leverages two key properties of retrievers: insensitivity to token order and bias towards influential tokens. By focusing on these characteristics, DIGA dynamically adjusts its genetic operations to generate effective adversarial passages with significantly reduced time and memory usage. Our experimental evaluation shows that DIGA achieves superior efficiency and scalability compared to existing methods, while maintaining comparable or better attack success rates across multiple datasets.
中文: 提出的动态重要性引导遗传算法(DIGA)通过利用检索器的特性,高效生成针对检索增强生成系统的对抗性段落,在显著降低计算成本的同时保持了较高的攻击成功率。
English: The proposed Dynamic Importance-Guided Genetic Algorithm (DIGA) efficiently generates adversarial passages for retrieval-augmented generation systems by exploiting retriever characteristics, achieving high attack success with reduced computational cost.

Authors:Yan Zhuang, Minheng Chen, Chao Cao, Tong Chen, Jing Zhang, Xiaowei Yu, Yanjun Lyu, Lu Zhang, Tianming Liu, Dajiang Zhu
Title: GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization
Abstract:
Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity.
中文: 本研究提出了一种可微分子网络划分框架,通过谱模块度最大化优化GyralNet的模块结构,结合拓扑和连接特征,揭示了三铰回在个体间保持一致的群落水平模式。
English: This study introduces a differentiable subnetwork partitioning framework that optimizes GyralNet's modular organization using spectral modularity maximization, integrating topological and connectivity features to reveal consistent community-level patterns of three-hinge gyri across individuals.

Authors:Yu Cui, Bryan Hooi, Yujun Cai, Yiwei Wang
Title: Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps
Abstract:
Recent reasoning large language models (LLMs) have demonstrated remarkable improvements in mathematical reasoning capabilities through long Chain-of-Thought. The reasoning tokens of these models enable self-correction within reasoning chains, enhancing robustness. This motivates our exploration: how vulnerable are reasoning LLMs to subtle errors in their input reasoning chains? We introduce "Compromising Thought" (CPT), a vulnerability where models presented with reasoning tokens containing manipulated calculation results tend to ignore correct reasoning steps and adopt incorrect results instead. Through systematic evaluation across multiple reasoning LLMs, we design three increasingly explicit prompting methods to measure CPT resistance, revealing that models struggle significantly to identify and correct these manipulations. Notably, contrary to existing research suggesting structural alterations affect model performance more than content modifications, we find that local ending token manipulations have greater impact on reasoning outcomes than structural changes. Moreover, we discover a security vulnerability in DeepSeek-R1 where tampered reasoning tokens can trigger complete reasoning cessation. Our work enhances understanding of reasoning robustness and highlights security considerations for reasoning-intensive applications.
中文摘要:近期推理大语言模型存在"妥协思维"漏洞,当推理链中的计算令牌被篡改时,模型会采纳错误结果甚至完全停止推理,这颠覆了以往关于结构修改比内容修改影响更大的认知。
English Summary: Recent reasoning LLMs show vulnerability to "Compromising Thought" (CPT), where manipulated calculation tokens cause models to adopt incorrect results and even trigger complete reasoning cessation in some cases, challenging previous assumptions about structural versus content modifications.

Authors:Wenhao You, Bryan Hooi, Yiwei Wang, Youke Wang, Zong Ke, Ming-Hsuan Yang, Zi Huang, Yujun Cai
Title: MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Abstract:
While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.
Chinese: MIRAGE是一种新颖的多模态越狱框架,通过叙事驱动的上下文和角色沉浸绕过多模态大语言模型的安全机制,实现了最先进的攻击成功率,揭示了当前多模态安全系统的脆弱性。
English: MIRAGE is a novel multimodal jailbreak framework that uses narrative-driven context and role immersion to bypass safety mechanisms in MLLMs, achieving state-of-the-art attack success rates and exposing vulnerabilities in current multimodal safety systems.

Authors:Youyu Chen, Junjun Jiang, Kui Jiang, Xiao Tang, Zhihao Li, Xianming Liu, Yinyu Nie
Title: DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds
Abstract:
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Specifically, we formulate 3DGS optimization as progressively fitting 3DGS to higher levels of frequency components in the training views, and propose a dynamic rendering resolution scheme that largely reduces the optimization complexity based on this formulation. Besides, we argue that a specific rendering resolution should cooperate with a proper primitive number for a better balance between computing redundancy and fitting quality, where we schedule the growth of the primitives to synchronize with the rendering resolution. Extensive experiments show that our method accelerates the optimization of various 3DGS backbones by 45.7% on average while preserving the rendering quality.
中文: DashGaussian是一种调度方案,通过动态降低渲染分辨率并同步原语增长来加速3D高斯泼溅优化,在保持渲染质量的同时平均减少45.7%的优化时间。
English: DashGaussian is a scheduling scheme that accelerates 3D Gaussian Splatting optimization by dynamically reducing rendering resolution and synchronizing primitive growth, cutting optimization time by 45.7% on average without compromising rendering quality.

Authors:Minheng Chen, Xiaowei Yu, Jing Zhang, Tong Chen, Chao Cao, Yan Zhuang, Yanjun Lyu, Lu Zhang, Tianming Liu, Dajiang Zhu
Title: Core-Periphery Principle Guided State Space Model for Functional Connectome Classification
Abstract:
Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.
中文: 提出的核心-外围状态空间模型(CP-SSM)结合线性复杂度的Mamba架构与大脑核心-外围组织特性,在功能连接组分类中超越Transformer模型,同时显著降低计算复杂度。
English: The proposed Core-Periphery State-Space Model (CP-SSM) combines a linear-complexity Mamba architecture with brain-inspired core-periphery organization to outperform Transformer models in functional connectome classification while drastically reducing computational costs.

Authors:Shuyang Hao, Yiwei Wang, Bryan Hooi, Jun Liu, Muhao Chen, Zi Huang, Yujun Cai
Title: Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
Abstract:
In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE's significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43\%, 21.01\% and 26.43\% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.
中文: HKVE框架根据注意力分数分布选择性接受梯度优化结果,确保每一步都有效增强对抗性越狱攻击,在多个模型上显著提高成功率的同时降低计算成本。
English: The HKVE framework selectively accepts gradient optimization results based on attention score distributions to ensure each step enhances adversarial jailbreak attacks, achieving significantly higher success rates across multiple models while reducing computational costs.

Authors:Shuyang Hao, Yiwei Wang, Bryan Hooi, Ming-Hsuan Yang, Jun Liu, Chengcheng Tang, Zi Huang, Yujun Cai
Title: Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense
Abstract:
Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.
中文摘要:ESIII是一种创新防御方法,通过将安全指令嵌入图像,将视觉空间转化为主动防御机制,有效增强大视觉语言模型对抗攻击的鲁棒性,同时保持良性任务性能且几乎不增加时间成本。
English Summary: ESIII is a novel defense method that embeds security instructions into images to protect large vision-language models from visual attacks, effectively enhancing robustness without compromising performance or adding significant computational costs.

Authors:Jing Zhang, Xiaowei Yu, Tong Chen, Chao Cao, Mingheng Chen, Yan Zhuang, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu
Title: BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification
Abstract:
The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.
Chinese: 本研究首创了BrainNet-MoE系统级人工神经网络,通过疾病特异性专家组处理脑亚网络并利用转换器实现交互,在区分路易体痴呆与阿尔茨海默病方面实现了高精度分类,并提供了可解释的脑网络贡献机制。
English: The study introduces BrainNet-MoE, a system-level artificial neural network that models brain sub-networks with specialized expert groups and a transformer for communication, achieving high accuracy and interpretable insights in differentiating Lewy body dementia from Alzheimer's disease.

Authors:Wenyu Wang, Mengqi Zhang, Xiaotian Ye, Zhaochun Ren, Zhumin Chen, Pengjie Ren
Title: UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets
Abstract:
Large Language Models (LLMs) inevitably acquire harmful information during training on massive datasets. LLM unlearning aims to eliminate the influence of such harmful information while maintaining the model's overall performance. Existing unlearning methods, represented by gradient ascent-based approaches, primarily focus on forgetting target data while overlooking the crucial impact of logically related knowledge on the effectiveness of unlearning. In this paper, through both theoretical and experimental analyses, we first demonstrate that a key reason for the suboptimal unlearning performance is that models can reconstruct the target content through reasoning with logically related knowledge. To address this issue, we propose Unlearning Improvement via Parameter Extrapolation (UIPE), a method that removes knowledge highly correlated with the forgetting targets. Experimental results show that UIPE significantly enhances the performance of various mainstream LLM unlearning methods on the TOFU benchmark.
中文摘要:大型语言模型在遗忘有害信息时效果不佳,因为它们能通过逻辑相关知识重建目标内容,而本文提出的UIPE方法通过消除高度关联知识,显著提升了主流遗忘方法的性能。
English Summary: Large Language Models (LLMs) struggle to effectively forget harmful information because they can reconstruct it through logically related knowledge, but the proposed UIPE method addresses this by removing correlated knowledge and significantly improves unlearning performance.

Authors:Guangfu Guo, Kai Zhang, Bryan Hoo, Yujun Cai, Xiaoqian Lu, Nanyun Peng, Yiwei Wang
Title: Structured Outputs Enable General-Purpose LLMs to be Medical Experts
Abstract:
Medical question-answering (QA) is a critical task for evaluating how effectively large language models (LLMs) encode clinical knowledge and assessing their potential applications in medicine. Despite showing promise on multiple-choice tests, LLMs frequently struggle with open-ended medical questions, producing responses with dangerous hallucinations or lacking comprehensive coverage of critical aspects. Existing approaches attempt to address these challenges through domain-specific fine-tuning, but this proves resource-intensive and difficult to scale across models. To improve the comprehensiveness and factuality of medical responses, we propose a novel approach utilizing structured medical reasoning. Our method guides LLMs through an seven-step cognitive process inspired by clinical diagnosis, enabling more accurate and complete answers without additional training. Experiments on the MedLFQA benchmark demonstrate that our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models. Notably, this improvement transfers to smaller models, highlighting the method's efficiency and scalability. Our code and datasets are available.
中文: 本研究提出了一种结构化医学推理方法,通过引导大语言模型遵循七步认知流程,无需额外训练即可显著提升答案的事实性和全面性,在MedLFQA基准测试中以85.8的事实性得分达到最优表现。
English: This study introduces a structured medical reasoning method that guides large language models through a seven-step cognitive process, significantly enhancing answer factuality and comprehensiveness without requiring additional training, as evidenced by achieving the highest Factuality Score of 85.8 on the MedLFQA benchmark.

Authors:Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
Title: SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
Abstract:
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.
中文:现有基于领域的预训练方法忽视了领域间重叠和样本特性,导致数据多样性和分布欠佳,而提出的SampleMix采用自下而上的样本级策略,通过质量和多样性评估动态优化领域分布,虽需更多训练步数,但在多任务中表现更优。
English: Existing domain-wise pretraining methods for LLMs overlook inter-domain overlaps and sample-specific features, leading to suboptimal data diversity and distribution, but the proposed SampleMix approach uses a bottom-up, sample-wise strategy to dynamically optimize domain distribution through quality and diversity evaluation, achieving superior performance across tasks despite requiring more training steps.

Authors:Boyi Ma, Yanguang Zhao, Jie Wang, Guankun Wang, Kun Yuan, Tong Chen, Long Bai, Hongliang Ren
Title: Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
Abstract:
The DeepSeek models have shown exceptional performance in general scene understanding, question-answering (QA), and text generation tasks, owing to their efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our empirical study shows that, compared to existing general-purpose multimodal large language models, DeepSeek-VL2 performs better on complex understanding tasks in surgical scenes. Additionally, although DeepSeek-V3 is purely a language model, we find that when image tokens are directly inputted, the model demonstrates better performance on single-sentence QA tasks. However, overall, the DeepSeek models still fall short of meeting the clinical requirements for understanding surgical scenes. Under general prompts, DeepSeek models lack the ability to effectively analyze global surgical concepts and fail to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek models are not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.
中文: DeepSeek模型在外科场景理解任务中相比通用多模态模型表现更优,但仍未达到临床要求,需要针对外科数据进行专门优化才能有效应用于医疗领域。
English: DeepSeek models demonstrate superior performance in surgical scene understanding tasks compared to general multimodal models, yet they still fall short of clinical requirements and require surgery-specific fine-tuning for effective application in medical contexts.

Authors:Beilei Cui, Long Bai, Mobarakol Islam, An Wang, Zhiqi Ma, Yiming Huang, Feng Li, Zhen Chen, Zhongliang Jiang, Nassir Navab, Hongliang Ren
Title: Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras
Abstract:
Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.
中文: 本文提出的Endo3DAC框架通过高效适配基础模型,实现了自监督内窥镜深度估计和三维场景重建的统一解决方案,在多个数据集上以更少的可训练参数取得了最优性能。
English: This paper introduces Endo3DAC, a unified framework that efficiently adapts foundation models for self-supervised endoscopic depth estimation and 3D scene reconstruction, achieving state-of-the-art performance with fewer trainable parameters across multiple datasets.

Authors:Bo Chen, Xiaoyu Li, Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song
Title: Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers
Abstract:
A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $Ω(n^2 d)$ memory, when $d = Ω(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.
中文: 视觉自回归模型在推理时面临内存开销的根本性挑战,研究证明基于注意力的架构在生成视觉标记时至少需要Ω(n²d)的内存,若不具备结构性约束则无法实现次平方级的内存压缩。
English: Visual autoregressive models face a fundamental memory bottleneck during inference, with a proven lower bound of Ω(n²d) memory required for sequential token generation under attention-based architectures, making sub-quadratic compression impossible without structural constraints.

Authors:Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian
Title: Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding
Abstract:
The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Ω(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.
中文: 本研究将空间复杂度障碍推广至张量注意力机制,通过通信复杂度建立内存下界,并探索KV缓存压缩中的时空权衡,为开发高效Transformer架构提供理论依据。
English: This study extends the space complexity barriers to tensor attention mechanisms, establishes memory lower bounds via communication complexity, and explores time-memory trade-offs in KV-cache compression for developing efficient Transformer architectures.

Authors:Pingrui Zhang, Xianqiang Gao, Yuhan Wu, Kehui Liu, Dong Wang, Zhigang Wang, Bin Zhao, Yan Ding, Xuelong Li
Title: MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
Abstract:
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: \href{https://momakitchen.github.io/}{https://momakitchen.github.io/}.
Chinese: MoMa-Kitchen基准数据集通过提供超过10万个样本,训练模型学习最优导航位置以实现与操作的顺畅衔接,采用自动化功能标签生成和NavAff基线模型,解决了导航与操作之间的脱节问题。
English: The MoMa-Kitchen benchmark dataset addresses the disconnect between navigation and manipulation by providing over 100,000 samples to train models for optimal positioning, enabling seamless transitions to object interaction through automated affordance labeling and a baseline model called NavAff.

Authors:Chengyue Gong, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yu Tian
Title: Theoretical Guarantees for High Order Trajectory Refinement in Generative Flows
Abstract:
Flow matching has emerged as a powerful framework for generative modeling, offering computational advantages over diffusion models by leveraging deterministic Ordinary Differential Equations (ODEs) instead of stochastic dynamics. While prior work established the worst case optimality of standard flow matching under Wasserstein distances, the theoretical guarantees for higher-order flow matching - which incorporates acceleration terms to refine sample trajectories - remain unexplored. In this paper, we bridge this gap by proving that higher-order flow matching preserves worst case optimality as a distribution estimator. We derive upper bounds on the estimation error for second-order flow matching, demonstrating that the convergence rates depend polynomially on the smoothness of the target distribution (quantified via Besov spaces) and key parameters of the ODE dynamics. Our analysis employs neural network approximations with carefully controlled depth, width, and sparsity to bound acceleration errors across both small and large time intervals, ultimately unifying these results into a general worst case optimal bound for all time steps.
中文摘要:高阶流匹配作为分布估计器保持了最坏情况下的最优性,其收敛速度与目标分布的平滑度及常微分方程参数呈多项式关系。
English Summary: Higher-order flow matching preserves worst-case optimality as a distribution estimator with convergence rates depending polynomially on target distribution smoothness and ODE parameters.

Authors:Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Title: HOFAR: High-Order Augmentation of Flow Autoregressive Transformers
Abstract:
Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.
中文摘要:本文提出了一种名为HOFAR的高阶监督框架,通过理论和实验验证,该框架能有效提升流自回归变换器的生成质量。
English Summary: This paper introduces a high-order supervised framework called HOFAR that enhances flow autoregressive transformers, demonstrating improved generation quality through theoretical and empirical analysis.

Authors:Haoyu Zheng, Qifan Yu, Binghe Yu, Yang Dai, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang
Title: SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models
Abstract:
Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.
中文摘要:SOYO是一种新颖的基于扩散模型的视频风格渐变框架,通过结合注意力注入和AdaIN技术无需微调即可保持视频帧结构连贯性,实现流畅自然的风格过渡。
English Summary: SOYO is a novel diffusion-based framework that enables smooth video style morphing by combining attention injection and AdaIN without fine-tuning, effectively preserving structural coherence and achieving seamless style transitions.

Authors:Yuefan Cao, Xuyang Guo, Jiayan Huo, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang, Zhen Zhuang
Title: Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help
Abstract:
Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
中文摘要:研究人员开发了T2ICountBench基准测试,发现当前最先进的文生图扩散模型普遍无法准确生成指定数量的物体,且数值复杂度增加时性能显著下降。
English Summary: Researchers have developed T2ICountBench, a benchmark revealing that state-of-the-art text-to-image diffusion models consistently fail to generate accurate object counts, with performance declining as numerical complexity increases.

Authors:Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang
Title: Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
Abstract:
Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs' ability to parse information from charts to answer questions. However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task.
中文: 本文提出创新的图表假设问答任务和HAI数据合成方法,以解决多模态大语言模型的输出偏差问题,并通过Chart-HQA基准测试发现现有模型存在泛化能力不足和推理性能不均衡的挑战。
English: This paper introduces a novel Chart Hypothetical Question Answering (HQA) task and the HAI data synthesis method to address MLLMs' output biases, revealing through the Chart-HQA benchmark that current models struggle with generalization and balanced reasoning.

Authors:Yifang Chen, Xuyang Guo, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Title: Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches
Abstract:
Recently, Large Language Models (LLMs) have achieved remarkable success. A key factor behind this success is the scaling law observed by OpenAI. Specifically, for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training, demonstrating trends that span more than seven orders of magnitude. This scaling law challenges traditional machine learning wisdom, notably the Oscar Scissors principle, which suggests that an overparametrized algorithm will overfit the training datasets, resulting in poor test performance. Recent research has also identified the scaling law in simpler machine learning contexts, such as linear regression. However, fully explaining the scaling law in large practical models remains an elusive goal. In this work, we advance our understanding by demonstrating that the scaling law phenomenon extends to multiple regression and kernel regression settings, which are significantly more expressive and powerful than linear methods. Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of LLMs.
Chinese: 扩展定律不仅适用于线性回归,还延伸至更复杂的多元回归和核回归,揭示了模型性能与规模、数据等因素之间的幂律关系,有助于深入理解大型语言模型的成功机制。
English: The scaling law, which shows a power-law relationship between model performance and factors like size and data, extends beyond linear regression to more complex multiple and kernel regression, offering deeper insights into the success of large language models.

Authors:Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, Errui Ding, Jingdong Wang, Youjian Zhao, Hang Zhou, Ziwei Liu
Title: AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
Abstract:
Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at https://guanjz20.github.io/projects/AudCast.
中文摘要:本文提出AudCast框架,通过级联扩散变换器实现从音频和参考图像生成具有准确口型同步和自然身体动作的整体人体视频。
English Summary: This paper introduces AudCast, a novel framework using cascade Diffusion-Transformers to generate holistic human videos with synchronized lip movements and natural body gestures from audio inputs and reference images.

Authors:Hao Ni, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, Jingkuan Song
Title: CFReID: Continual Few-shot Person Re-Identification
Abstract:
Real-world surveillance systems are dynamically evolving, requiring a person Re-identification model to continuously handle newly incoming data from various domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed to learn and accumulate knowledge across multiple domains incrementally. However, LReID models need to be trained on large-scale labeled data for each unseen domain, which are typically inaccessible due to privacy and cost concerns. In this paper, we propose a new paradigm called Continual Few-shot ReID (CFReID), which requires models to be incrementally trained using few-shot data and tested on all seen domains. Under few-shot conditions, CFREID faces two core challenges: 1) learning knowledge from few-shot data of unseen domain, and 2) avoiding catastrophic forgetting of seen domains. To tackle these two challenges, we propose a Stable Distribution Alignment (SDA) framework from feature distribution perspective. Specifically, our SDA is composed of two modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot Adaptation (PFA). To support the study of CFReID, we establish an evaluation benchmark for CFReID on five publicly available ReID datasets. Extensive experiments demonstrate that our SDA can enhance the few-shot learning and anti-forgetting capabilities under few-shot conditions. Notably, our approach, using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's state-of-the-art performance, which requires 700 to 1,000 IDs.
中文: 本文提出持续小样本行人重识别(CFReID)新范式,通过稳定分布对齐框架解决动态监控系统中少样本学习与抗遗忘两大核心挑战,仅用5%数据量即显著超越现有最优方法。
English: The paper introduces Continual Few-shot ReID (CFReID), a new paradigm that addresses the challenges of learning from limited data and preventing knowledge loss in evolving surveillance systems through a Stable Distribution Alignment framework.

Authors:Yingying Fan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Errui Ding, Yu Wu, Jingdong Wang
Title: Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
Abstract:
Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To tackle these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout adjustment strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: https://fyycs.github.io/Re-HOLD.
中文:提出的Re-HOLD框架通过自适应布局表征和记忆增强纹理模块,有效解决了人-物交互视频生成的难题,在性能上显著超越现有方法。
English: The proposed Re-HOLD framework addresses the underexplored challenge of generating realistic human-object interaction videos by employing adaptive layout representations and memory-enhanced texture modules, significantly outperforming existing methods.

Authors:Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen
Title: Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
Abstract:
Despite their success, Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations. While existing studies attribute the cause of hallucinations to insufficient visual attention to image tokens, our findings indicate that hallucinations also arise from interference from instruction tokens during decoding. Intuitively, certain instruction tokens continuously distort LVLMs' visual perception during decoding, hijacking their visual attention toward less discriminative visual regions. This distortion prevents them integrating broader contextual information from images, ultimately leading to hallucinations. We term this phenomenon 'Attention Hijacking', where disruptive instruction tokens act as 'Attention Hijackers'. To address this, we propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID), designed to isolate the influence of Hijackers, enabling LVLMs to rely on their context-aware intrinsic attention map. Specifically, AID consists of three components: First, Attention Hijackers Detection identifies Attention Hijackers by calculating instruction-driven visual salience. Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers, and thereby mitigate their disruptive influence on subsequent tokens. Finally, Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects. Extensive experiments demonstrate that AID significantly reduces hallucination across various LVLMs on several benchmarks.
中文: 大型视觉语言模型因受指令标记“注意力劫持”而产生幻觉,本文提出的免训练AID方法通过检测并分离劫持因子来恢复模型的视觉注意力,从而有效缓解该问题。
English: Large Vision-Language Models suffer from hallucinations due to 'Attention Hijacking' by disruptive instruction tokens, which is effectively mitigated by the proposed training-free AID method that detects and disentangles these hijackers to restore proper visual attention.

Authors:Xuanhan Wang, Huimin Deng, Lianli Gao, Jingkuan Song
Title: Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models
Abstract:
Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications.To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework pretraining lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream tasks.Extensive experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.
中文: 尺度感知图像预训练(SAIP)框架通过跨尺度一致性目标预训练轻量模型,解决了以人为中心的视觉感知在泛化性和计算成本方面的局限,并在多类任务中实现显著性能提升。
English: The Scale-Aware Image Pretraining (SAIP) framework addresses limitations in human-centric visual perception by pretraining lightweight models with cross-scale consistency objectives, achieving significant performance gains across diverse tasks.

Authors:Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan
Title: PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
Abstract:
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.
Chinese: 我们推出了PHYSICS,一个大学物理综合基准测试集,用于评估高阶问题解决能力,并揭示当前基础模型的显著局限性,最优模型准确率仅为59.9%。
English: We introduce PHYSICS, a comprehensive benchmark for university-level physics that evaluates advanced problem-solving skills and reveals significant limitations in current foundation models, with the best achieving only 59.9% accuracy.

Authors:Yunhai Hu, Yilun Zhao, Chen Zhao, Arman Cohan
Title: MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
Abstract:
We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.
中文: MCTS-RAG方法通过将检索增强生成与蒙特卡洛树搜索相结合,动态优化推理路径与外部知识整合,使小型语言模型在知识密集型任务中达到与GPT-4o等前沿模型相媲美的性能表现。
English: MCTS-RAG enhances small language models' reasoning on knowledge tasks by integrating retrieval-augmented generation with Monte Carlo Tree Search, enabling dynamic context refinement and achieving performance comparable to advanced models like GPT-4o.

Authors:Yunhai Hu, Yilun Zhao, Chen Zhao, Arman Cohan
Title: MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
Abstract:
We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.
中文: MCTS-RAG方法通过将检索增强生成与蒙特卡洛树搜索相结合,动态优化推理路径与外部知识整合,使小型语言模型在知识密集型任务中达到与GPT-4o等前沿模型相媲美的性能表现。
English: MCTS-RAG enhances small language models' reasoning on knowledge tasks by integrating retrieval-augmented generation with Monte Carlo Tree Search, enabling dynamic context refinement and achieving performance comparable to advanced models like GPT-4o.

Authors:Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang
Title: Prompting is Not All You Need! Evaluating LLM Agent Simulation Methodologies with Real-World Online Customer Behavior Data
Abstract:
Recent research shows that LLMs can simulate ``believable'' human behaviors to power LLM agents via prompt-only methods. In this work, we focus on evaluating LLM's objective ``accuracy'' rather than the subjective ``believability'' in simulating human behavior, leveraging a large-scale, real-world dataset collected from customers' online shopping actions. We present the first comprehensive evaluation of state-of-the-art LLMs (e.g., DeepSeek-R1, Llama, and Claude) on the task of web shopping action generation. Our results show that out-of-the-box LLM-generated actions are often misaligned with actual human behavior, whereas fine-tuning LLMs on real-world behavioral data substantially improves their ability to generate accurate actions compared to prompt-only methods. Furthermore, incorporating synthesized reasonings into model training leads to additional performance gains, demonstrating the value of explicit rationale in behavior modeling. This work evaluates state-of-the-art LLMs in behavior simulation and provides actionable insights into how real-world action data can enhance the fidelity of LLM agents.
中文: 本研究首次建立了评估大语言模型模拟人类购物行为的严格基准,发现基于提示的方法仅实现11.86%的行为生成准确率,而通过真实人类数据微调的模型可将性能显著提升5.4-13.85%。
English: This study establishes the first rigorous benchmark for evaluating LLM agents' ability to simulate human shopping behavior, revealing that prompt-based methods achieve only 11.86% accuracy while demonstrating that fine-tuning with real human data can significantly improve performance by 5.4-13.85%.

Authors:Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang
Title: Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Abstract:
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
中文: 本研究首次建立了评估大语言模型模拟人类购物行为的严格基准,发现基于提示的方法仅实现11.86%的行为生成准确率,而通过真实人类数据微调的模型可将性能显著提升5.4-13.85%。
English: This study establishes the first rigorous benchmark for evaluating LLM agents' ability to simulate human shopping behavior, revealing that prompt-based methods achieve only 11.86% accuracy while demonstrating that fine-tuning with real human data can significantly improve performance by 5.4-13.85%.

Authors:Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng
Title: LangBridge: Interpreting Image as a Combination of Language Embeddings
Abstract:
Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://jiaqiliao77.github.io/LangBridge.github.io/
中文摘要:近年来大型视觉语言模型虽取得显著进展,但其跨模态对齐机制尚不明确且需针对不同骨干网络重新训练,为此提出的LangBridge适配器通过将视觉表征映射至语言模型词嵌入空间,实现了无需预训练的跨模型迁移,在保持性能的同时增强了可解释性。
English Summary: Recent advances in Large Vision-Language Models (LVLMs) have achieved human-level performance, but their underlying alignment mechanisms remain poorly understood and require retraining for different backbones, leading to the development of LangBridge, a novel adapter that enables pretraining-free transfer across LLMs while maintaining performance through interpretable visual-to-text embedding mapping.

Authors:Kangwei Liu, Mengru Wang, Yujie Luo, Yuan Lin, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Jun Zhou, Bryan Hooi, Shumin Deng
Title: LookAhead Tuning: Safer Language Models via Partial Answer Previews
Abstract:
Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.
中文: LookAhead Tuning是一种轻量级的数据驱动方法,通过在训练数据中引入部分答案前缀来保持大型语言模型在微调过程中的安全对齐,确保在不牺牲安全性的前提下维持强大的下游任务性能。
English: LookAhead Tuning is a lightweight, data-driven method that preserves the safety alignment of large language models during fine-tuning by modifying training data with partial answer prefixes, ensuring robust performance without compromising safety.

Authors:Huiqiang Chen, Tianqing Zhu, Linlin Wang, Xin Yu, Longxiang Gao, Wanlei Zhou
Title: Safe and Reliable Diffusion Models via Subspace Projection
Abstract:
Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.
中文摘要:SAFER是一种创新方法,通过识别目标概念的子空间并将提示嵌入投影至其补空间,有效彻底地从文本到图像扩散模型中移除不良概念,同时保持生成图像质量。
English Summary: SAFER is a novel approach that effectively removes unwanted concepts from text-to-image diffusion models by identifying and projecting prompt embeddings away from the concept's subspace, ensuring thorough erasure while maintaining image quality.

Authors:Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
Title: Survey on Evaluation of LLM-based Agents
Abstract:
The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
中文: 本文首次系统综述了大语言模型智能体的评估方法,从核心能力、应用场景和评估框架等维度分析评测基准,揭示了评估向真实化发展的趋势,并指出在安全性、鲁棒性等维度存在的不足。
English: This paper presents the first comprehensive survey of evaluation methodologies for LLM-based agents, analyzing benchmarks across core capabilities, applications, and frameworks while identifying trends toward realistic testing and gaps in safety and scalability.

Authors:Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, Chuxu Zhang
Title: MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models
Abstract:
High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.
高质量数据对训练大型语言模型至关重要,而MASS框架能高效筛选数学推理数据子集,在减少50-70%训练数据量的同时提升模型性能3.3-5.9%。
High-quality data is crucial for training large language models, and the MASS framework efficiently selects optimal mathematical reasoning subsets to enhance model performance while reducing token usage by 50-70%.

Authors:Zhaopan Xu, Pengfei Zhou, Weidong Tang, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang
Title: PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable success in vision-language tasks, but their reliance on vast, internet-sourced data raises significant privacy and security concerns. Machine unlearning (MU) has emerged as a critical technique to address these issues, enabling the selective removal of targeted information from pre-trained models without costly retraining. However, the evaluation of MU for MLLMs remains inadequate. Existing benchmarks often lack a comprehensive scope, focusing narrowly on entities while overlooking the unlearning of broader visual concepts and the inherent semantic coupling between them. To bridge this gap, we introduce, PEBench, a novel benchmark designed to facilitate a thorough assessment of MU in MLLMs. PEBench features a fictitious dataset of personal entities and corresponding event scenes to evaluate unlearning across these distinct yet entangled concepts. We leverage this benchmark to evaluate five MU methods, revealing their unique strengths and weaknesses. Our findings show that unlearning one concept can unintentionally degrade performance on related concepts within the same image, a challenge we term cross-concept interference. Furthermore, we demonstrate the difficulty of unlearning person and event concepts simultaneously and propose an effective method to mitigate these conflicting objectives. The source code and benchmark are publicly available at https://pebench.github.io.
Chinese: 多模态大语言模型依赖网络数据引发隐私担忧,机器遗忘技术应运而生,但现有评估体系不完善,为此我们推出PEBench基准,全面评估概念遗忘效果,揭示了跨概念干扰等难题,并提出解决方案。
English: Multimodal large language models face privacy risks from web data, prompting the development of machine unlearning, yet current benchmarks inadequately assess its effectiveness, leading to the creation of PEBench to evaluate unlearning across interconnected concepts and identify challenges like cross-concept interference.

Authors:Zhaopan Xu, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang
Title: MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
Abstract:
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.
Chinese Summary: MPBench是一个全面的多模态基准,通过三个评估范式——步骤正确性、答案聚合和推理过程搜索——来系统评估过程级奖励模型在不同场景下的有效性,以提升大型语言模型的推理能力。
English Summary: MPBench is a comprehensive multimodal benchmark designed to evaluate process-level reward models (PRMs) across three evaluation paradigms—step correctness, answer aggregation, and reasoning process search—to enhance reasoning accuracy in large language models.

Authors:Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, Peter Henderson, Sherry Tongshuang Wu, Patrick C. Kyllonen, Lucy Cheke, Xing Xie, José Hernández-Orallo
Title: General Scales Unlock AI Evaluation with Explanatory and Predictive Power
Abstract:
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)
中文: 本文提出了通用的人工智能评估量表,通过分析需求和能力概况,增强了评估的解释和预测能力,利用自动化量规和优化的性能预测,特别是在分布外场景中,确保人工智能的可靠部署。
English: This paper introduces general scales for AI evaluation that enhance explanatory and predictive power by analyzing demand and ability profiles, enabling reliable deployment through automated rubrics and improved performance forecasting, especially in out-of-distribution scenarios.

Authors:Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou
Title: Do Fairness Interventions Come at the Cost of Privacy: Evaluations for Binary Classifiers
Abstract:
While in-processing fairness approaches show promise in mitigating biased predictions, their potential impact on privacy leakage remains under-explored. We aim to address this gap by assessing the privacy risks of fairness-enhanced binary classifiers via membership inference attacks (MIAs) and attribute inference attacks (AIAs). Surprisingly, our results reveal that enhancing fairness does not necessarily lead to privacy compromises. For example, these fairness interventions exhibit increased resilience against MIAs and AIAs. This is because fairness interventions tend to remove sensitive information among extracted features and reduce confidence scores for the majority of training data for fairer predictions. However, during the evaluations, we uncover a potential threat mechanism that exploits prediction discrepancies between fair and biased models, leading to advanced attack results for both MIAs and AIAs. This mechanism reveals potent vulnerabilities of fair models and poses significant privacy risks of current fairness methods. Extensive experiments across multiple datasets, attack methods, and representative fairness approaches confirm our findings and demonstrate the efficacy of the uncovered mechanism. Our study exposes the under-explored privacy threats in fairness studies, advocating for thorough evaluations of potential security vulnerabilities before model deployments.
中文: 增强二元分类器的公平性虽能通过去除敏感特征和降低置信度意外提升对隐私攻击的抵御力,但也因公平与偏见模型间的预测差异而催生新的威胁机制,暴露出严重的隐私漏洞。
English: Enhancing fairness in binary classifiers can unexpectedly improve resilience against privacy attacks by removing sensitive features and reducing confidence scores, yet it also introduces a new threat mechanism exploiting prediction discrepancies between fair and biased models, revealing significant vulnerabilities.

Authors:Ling Team, Binwei Zeng, Chao Huang, Chao Zhang, Changxin Tian, Cong Chen, Dingnan Jin, Feng Yu, Feng Zhu, Feng Yuan, Fakang Wang, Gangshan Wang, Guangyao Zhai, Haitao Zhang, Huizhong Li, Jun Zhou, Jia Liu, Junpeng Fang, Junjie Ou, Jun Hu, Ji Luo, Ji Zhang, Jian Liu, Jian Sha, Jianxue Qian, Jiewei Wu, Junping Zhao, Jianguo Li, Jubao Feng, Jingchao Di, Junming Xu, Jinghua Yao, Kuan Xu, Kewei Du, Longfei Li, Lei Liang, Lu Yu, Li Tang, Lin Ju, Peng Xu, Qing Cui, Song Liu, Shicheng Li, Shun Song, Song Yan, Tengwei Cai, Tianyi Chen, Ting Guo, Ting Huang, Tao Feng, Tao Wu, Wei Wu, Xiaolu Zhang, Xueming Yang, Xin Zhao, Xiaobo Hu, Xin Lin, Yao Zhao, Yilong Wang, Yongzhen Guo, Yuanyuan Wang, Yue Yang, Yang Cao, Yuhao Fu, Yi Xiong, Yanzhe Li, Zhe Li, Zhiqiang Zhang, Ziqi Liu, Zhaoxin Huan, Zujie Wen, Zhenhang Sun, Zhuoxuan Du, Zhengyu He
Title: Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Abstract:
In this technical report, we tackle the challenges of training large-scale Mixture of Experts (MoE) models, focusing on overcoming cost inefficiency and resource limitations prevalent in such systems. To address these issues, we present two differently sized MoE large language models (LLMs), namely Ling-Lite and Ling-Plus (referred to as "Bailing" in Chinese, spelled Bǎilíng in Pinyin). Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. Both models exhibit comparable performance to leading industry benchmarks. This report offers actionable insights to improve the efficiency and accessibility of AI development in resource-constrained settings, promoting more scalable and sustainable technologies. Specifically, to reduce training costs for large-scale MoE models, we propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency. Additionally, leveraging high-quality data generated from knowledge graphs, our models demonstrate superior capabilities in tool use compared to other models. Ultimately, our experimental findings demonstrate that a 300B MoE LLM can be effectively trained on lower-performance devices while achieving comparable performance to models of a similar scale, including dense and MoE models. Compared to high-performance devices, utilizing a lower-specification hardware system during the pre-training phase demonstrates significant cost savings, reducing computing costs by approximately 20%. The models can be accessed at https://huggingface.co/inclusionAI.
本报告介绍了两种高效能的MoE大语言模型——灵珑轻量版和增强版,它们通过优化架构与训练流程,在保证业界领先性能的同时显著降低了计算成本,为资源受限环境提供了更可及的AI开发方案。
This report introduces two cost-efficient MoE large language models, Ling-Lite and Ling-Plus, which achieve competitive performance while reducing training expenses through optimized architecture and processes, making advanced AI more accessible in resource-limited environments.

Authors:Hongzhi Luan, Changxin Tian, Zhaoxin Huan, Xiaolu Zhang, Kunlong Chen, Zhiqiang Zhang, Jun Zhou
Title: BOSE: A Systematic Evaluation Method Optimized for Base Models
Abstract:
This paper poses two critical issues in evaluating base models (without post-training): (1) Unstable evaluation during training: in the early stages of pre-training, the models lack the capability to answer questions as required, leading to unstable evaluation results. This instability makes it difficult to provide solid conclusions to guide the training, especially for key experiments such as data ablation and scaling law. (2) Inconsistency between base and instruct models: base models generally exhibit poorer evaluation performance compared to corresponding instruct models. This gap poses a challenge for assessing whether a base model with better evaluation can truly lead to a better instruct model. To address these issues, we propose Base model Oriented Systematic Evaluation (BOSE), a method specifically designed to optimize the evaluation of base models. Specifically, BOSE introduces two key innovations: In-Context Light-instruction Prompt (ICLiP) for open-ended tasks and Blank-ppl for multi-choice tasks with candidate options, which transforms the standard perplexity (ppl) metric into a fill-in-the-blank format to mitigate early-stage evaluation fluctuations. Furthermore, we are the first to propose Kendall's rank correlation to quantitatively measure the evaluation stability and consistency. Experimental results demonstrate that BOSE significantly enhances both the stability of evaluations during pre-training and the consistency between base and instruct models, thereby providing more reliable guidance for the LLMs' training.
中文摘要:本文提出BOSE方法,通过引入ICLiP和Blank-ppl等创新指标,有效解决了基础模型训练中评估结果不稳定以及与指令模型性能不一致的问题,显著提升了评估的稳定性和一致性。
English Summary: This paper introduces BOSE, a method designed to address unstable evaluation during base model training and the performance gap between base and instruct models, using innovative metrics like ICLiP and Blank-ppl to improve stability and consistency.

Authors:Jiaju Chen, Minglong Tang, Yuxuan Lu, Bingsheng Yao, Elissa Fan, Xiaojuan Ma, Ying Xu, Dakuo Wang, Yuling Sun, Liang He
Title: Characterizing LLM-Empowered Personalized Story-Reading and Interaction for Children: Insights from Multi-Stakeholder Perspectives
Abstract:
Personalized interaction is highly valued by parents in their story-reading activities with children. While AI-empowered story-reading tools have been increasingly used, their abilities to support personalized interaction with children are still limited. Recent advances in large language models (LLMs) show promise in facilitating personalized interactions, but little is known about how to effectively and appropriately use LLMs to enhance children's personalized story-reading experiences. This work explores this question through a design-based study. Drawing on a formative study, we designed and developed StoryMate, an LLM-empowered personalized interactive story-reading tool for children, following an empirical study with children, parents, and education experts. Our participants valued the personalized features in StoryMate, and also highlighted the need to support personalized content, guiding mechanisms, reading context variations, and interactive interfaces. Based on these findings, we propose a series of design recommendations for better using LLMs to empower children's personalized story reading and interaction.
中文: 本研究探讨如何有效利用大语言模型提升儿童个性化故事阅读互动,开发了StoryMate原型,并针对内容定制、引导机制、阅读情境和交互界面提出了关键设计建议。
English: This study explores the effective use of large language models (LLMs) to enhance personalized story-reading interactions for children, developing StoryMate as a prototype and identifying key design recommendations for content, guidance, context, and interface customization.

Authors:Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li
Title: LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
Abstract:
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
中文: LeX-Art提出了一套全面的文本到图像合成方案,通过构建高质量数据集、开发提示增强模型和训练先进生成模型,在文本渲染保真度和准确性上取得显著提升,并建立了系统化评估基准。
English: LeX-Art introduces a comprehensive text-image synthesis suite featuring high-quality dataset construction, prompt enhancement models, and state-of-the-art text-to-image models, achieving significant improvements in text rendering fidelity and accuracy through systematic evaluation benchmarks.

Authors:Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, Weipeng Chen
Title: ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
中文摘要:ReSearch是一种创新的强化学习框架,通过将搜索操作融入推理链来训练大语言模型,无需监督数据即可激发反思与自我修正等高级推理能力。
English Summary: ReSearch is a novel reinforcement learning framework that trains Large Language Models to integrate search operations into reasoning chains, enabling advanced capabilities like reflection and self-correction without supervised data.

Authors:Xuan Wang, Siyuan Liang, Dongping Liao, Han Fang, Aishan Liu, Xiaochun Cao, Yu-liang Lu, Ee-Chien Chang, Xitong Gao
Title: Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
Abstract:
Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e.g., supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines across supervised, semi-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.
中文: 本文提出了一种统一的后门检测框架,通过交叉检验模型不一致性和中心核对齐技术,能在不同学习范式中精准识别后门触发器,其检测准确率显著优于现有方法,并首次在多模态大语言模型中实现有效检测。
English: This paper introduces a unified backdoor detection framework that leverages cross-examination of model inconsistencies and central kernel alignment to accurately identify backdoor triggers across various learning paradigms, significantly outperforming existing methods in detection accuracy and demonstrating effectiveness even in multimodal large language models.

Authors:Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue
Title: Aligning Text-to-Music Evaluation with Human Preferences
Abstract:
Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).
中文摘要:本研究批判了当前文本到音乐模型的评估指标,揭示其与人类偏好和音乐品质的匹配度较差,并提出名为MAD的新指标,在合成测试和人类相关性方面均显著优于现有方法。
English Summary: This study critiques current evaluation metrics for text-to-music models, revealing their poor alignment with human preferences and musical qualities, and introduces a new metric called MAD that significantly outperforms existing methods in both synthetic tests and human correlation.

Authors:Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
Title: AudioX: Diffusion Transformer for Anything-to-Audio Generation
Abstract:
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/
中文: AudioX作为一种统一的扩散变换器模型,通过多模态掩码训练策略和构建大规模数据集,实现了任意内容到音频的高质量生成,既能处理文本、视频等多种输入,又超越了专业模型的性能表现。
English: AudioX is a unified Diffusion Transformer model that overcomes limitations in audio and music generation by enabling anything-to-audio conversion with flexible natural language control and robust cross-modal learning, while newly curated datasets address data scarcity issues.

Authors:Bokai Xu, Jiayi Zhang, Zhongtao Chen, Bingyang Cheng, Ziheng Liu, Yik-Chung Wu, Bo Ai
Title: Channel Estimation for Rydberg Atomic Receivers
Abstract:
The rapid development of the quantum technology presents huge opportunities for 6G communications. Leveraging the quantum properties of highly excited Rydberg atoms, Rydberg atom-based antennas present distinct advantages, such as high sensitivity, broad frequency range, and compact size, over traditional antennas. To realize efficient precoding, accurate channel state information is essential. However, due to the distinct characteristics of atomic receivers, traditional channel estimation algorithms developed for conventional receivers are no longer applicable. To this end, we propose a novel channel estimation algorithm based on projection gradient descent (PGD), which is applicable to both one-dimensional (1D) and twodimensional (2D) arrays. Simulation results are provided to show the effectiveness of our proposed channel estimation method.
中文摘要:量子技术的快速发展为6G通信带来巨大机遇,基于里德堡原子的天线凭借高灵敏度和紧凑结构优于传统天线,为此我们提出适用于原子接收器的投影梯度下降信道估计算法,有效解决了传统方法不适用的问题。
English Summary: Quantum technology advancements offer significant potential for 6G communications, where Rydberg atom-based antennas provide superior sensitivity and compactness over traditional designs, necessitating a novel projection gradient descent algorithm for effective channel estimation in these systems.

Authors:Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe
Title: ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Abstract:
Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.
中文: 该工具包为级联和端到端口语对话系统提供了统一的网络界面进行评估和比较,揭示出现有端到端系统在音频质量和响应多样性方面表现较差。
English: The toolkit provides a unified web interface for evaluating and comparing cascaded and end-to-end spoken dialogue systems, revealing that current E2E systems exhibit poorer audio quality and less diverse responses.

Authors:Yizhuo Li, Jiakang Zheng, Bokai Xu, Yiyang Zhu, Jiayi Zhang, Dusit Niyato, Bo Ai
Title: Beamforming Design for Beyond Diagonal RIS-Aided Cell-Free Massive MIMO Systems
Abstract:
Reconfigurable intelligent surface (RIS)-aided cell-free (CF) massive multiple-input multiple-output (mMIMO) is a promising architecture for further improving spectral efficiency (SE) with low cost and power consumption. However, conventional RIS has inevitable limitations due to its capability of only reflecting signals. In contrast, beyond-diagonal RIS (BD-RIS), with its ability to both reflect and transmit signals, has gained great attention. This correspondence focuses on using BD-RIS to improve the sum SE of CF mMIMO systems. This requires completing the beamforming design under the transmit power constraints and unitary constraints of the BD-RIS, by optimizing active and passive beamformer simultaneously. To tackle this issue, we introduce an alternating optimization algorithm that decomposes it using fractional programming and solves the subproblems alternatively. Moreover, to address the challenge introduced by the unitary constraint on the beamforming matrix of the BD-RIS, a manifold optimization algorithm is proposed to solve the problem optimally. Simulation results show that BD-RISs outperform RISs comprehensively, especially in the case of the full connected architecture which achieves the best performance, enhancing the sum SE by around 40% compared to ideal RISs.
中文: 超对角可重构智能表面通过同时反射和传输信号,结合交替优化算法,将无蜂窝大规模MIMO系统的总频谱效率提升约40%,显著优于传统智能表面。
English: BD-RIS enhances cell-free massive MIMO systems by enabling simultaneous reflection and transmission, using an alternating optimization algorithm to boost spectral efficiency by up to 40% compared to conventional RIS.

Authors:Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Fu-Yun Wang, Yuchi Wang, Renrui Zhang, Peng Gao, Hongsheng Li
Title: TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation
Abstract:
Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs-a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.
中文:TIDE提出了一种时序感知的稀疏自编码器框架,用于提取扩散变换器中的可解释特征,揭示了其分层语义学习能力,并在保持生成质量的同时实现了可控图像编辑。
English: TIDE introduces a temporal-aware sparse autoencoder framework to extract interpretable features in Diffusion Transformers, revealing their hierarchical semantic learning and enabling controllable image editing without compromising generation quality.

Authors:Liming Lu, Shuchao Pang, Siyuan Liang, Haotian Zhu, Xiyu Zeng, Aishan Liu, Yunhuai Liu, Yongbin Zhou
Title: Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Abstract:
Multimodal large language models (MLLMs) have made remarkable strides in cross-modal comprehension and generation tasks. However, they remain vulnerable to jailbreak attacks, where crafted perturbations bypass security guardrails and elicit harmful outputs. In this paper, we present the first adversarial training (AT) paradigm tailored to defend against jailbreak attacks during the MLLM training phase. Extending traditional AT to this domain poses two critical challenges: efficiently tuning massive parameters and ensuring robustness against attacks across multiple modalities. To address these challenges, we introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework. ProEAT incorporates a projector-based adversarial training architecture that efficiently handles large-scale parameters while maintaining computational feasibility by focusing adversarial training on a lightweight projector layer instead of the entire model; additionally, we design a dynamic weight adjustment mechanism that optimizes the loss function's weight allocation based on task demands, streamlining the tuning process. To enhance defense performance, we propose a joint optimization strategy across visual and textual modalities, ensuring robust resistance to jailbreak attacks originating from either modality. Extensive experiments conducted on five major jailbreak attack methods across three mainstream MLLMs demonstrate the effectiveness of our approach. ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities, while incurring only a 1% reduction in clean accuracy. Furthermore, evaluations on real-world embodied intelligent systems highlight the practical applicability of our framework, paving the way for the development of more secure and reliable multimodal systems.
中文: 本文提出的ProEAT框架通过轻量级投影层对抗训练和跨模态联合优化,有效防御多模态大语言模型的越狱攻击,在保持清洁准确率仅下降1%的同时,将防御性能平均提升34%。
English: This paper introduces ProEAT, an adversarial training framework that effectively defends multimodal large language models against jailbreak attacks by focusing on lightweight projector layers and joint cross-modal optimization, achieving a 34% average improvement in defense performance with minimal impact on clean accuracy.

Authors:Fei Wei, Yaliang Li, Bolin Ding
Title: Towards Anthropomorphic Conversational AI Part I: A Practical Framework
Abstract:
Large language models (LLMs), due to their advanced natural language capabilities, have seen significant success in applications where the user interface is usually a conversational artificial intelligence (AI) agent and engages the user through multi-round conversations. However, many scenarios require the agents to exhibit stronger social and conversational intelligence and demonstrate more human-like (anthropomorphic) reactions. This is an aspect that foundational LLMs have yet to fully address such that a single call of foundational models might be insufficient. To bridge this gap, we propose a two-stage solution. In this work, we focus on the first stage, introducing a multi-module framework designed to replicate the key aspects of human intelligence involved in conversations. This framework comprises thinking modules for reasoning, resource modules for managing knowledge and external information, and response modules for generating contextually appropriate interactions. With all the modules cooperating, the framework would empower the agents to provide a better human-like conversation experience. In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning, enabling AI to better capture human preferences. This stage is left for future work. In our experiments, volunteers engaged in over 3000 rounds of conversation with the same AI character powered by a standalone LLM and our framework which integrates the same LLM. A separate group of evaluators rated the conversation samples, revealing that our framework significantly enhanced the social and conversational intelligence, even without fine-tuning the LLM.
中文: 本研究提出一个多模块框架,通过模拟人类对话中的关键智能要素,在不精调大语言模型的情况下显著提升了AI代理的社交与对话智能水平。
English: To enhance the social and conversational intelligence of AI agents, this study introduces a multi-module framework that simulates human cognitive processes, significantly improving interaction quality without requiring LLM fine-tuning.

Authors:Zhe Wang, Jiayi Zhang, Hao Lei, Dusit Niyato, Bo Ai
Title: Optimal Bilinear Equalizer Beamforming Design for Cell-Free Massive MIMO Networks with Arbitrary Channel Estimators
Abstract:
This paper studies the distributed optimal bilinear equalizer (OBE) beamforming design for both the uplink and downlink cell-free massive multiple-input multiple-output networks. We consider arbitrary statistics-based channel estimators over spatially correlated Rician fading channels. In the uplink, we derive the achievable spectral efficiency (SE) performance and OBE combining schemes with arbitrary statistics-based channel estimators and compute their respective closed-form expressions. It is insightful to explore that the achievable SE performance is not dependent on the choice of channel estimator when OBE combining schemes are applied over Rayleigh channels. In the downlink, we derive the achievable SE performance expressions with BE precoding schemes and arbitrary statistics-based channel estimators utilized and compute them in closed form. Then, we obtain the OBE precoding scheme leveraging insights from uplink OBE combining schemes.
中文: 本文针对无蜂窝大规模MIMO网络设计了分布式最优双线性均衡器波束成形,推导了上行链路组合和下行链路预编码方案的闭式频谱效率表达式,这些方案适用于莱斯衰落信道上的任意统计信道估计器。
English: This paper designs distributed optimal bilinear equalizer beamforming for cell-free massive MIMO networks, deriving closed-form spectral efficiency expressions for uplink combining and downlink precoding schemes that work with arbitrary channel estimators over Rician fading channels.

Authors:Linghao Feng, Dongcheng Zhao, Sicheng Shen, Yi Zeng
Title: Biologically Inspired Spiking Diffusion Model with Adaptive Lateral Selection Mechanism
Abstract:
Lateral connection is a fundamental feature of biological neural circuits, facilitating local information processing and adaptive learning. In this work, we integrate lateral connections with a substructure selection network to develop a novel diffusion model based on spiking neural networks (SNNs). Unlike conventional artificial neural networks, SNNs employ an intrinsic spiking inner loop to process sequential binary spikes. We leverage this spiking inner loop alongside a lateral connection mechanism to iteratively refine the substructure selection network, enhancing model adaptability and expressivity. Specifically, we design a lateral connection framework comprising a learnable lateral matrix and a lateral mapping function, both implemented using spiking neurons, to dynamically update lateral connections. Through mathematical modeling, we establish that the proposed lateral update mechanism, under a well-defined local objective, aligns with biologically plausible synaptic plasticity principles. Extensive experiments validate the effectiveness of our approach, analyzing the role of substructure selection and lateral connection during training. Furthermore, quantitative comparisons demonstrate that our model consistently surpasses state-of-the-art SNN-based generative models across multiple benchmark datasets.
中文摘要:本研究提出了一种基于脉冲神经网络的创新扩散模型,通过结合侧向连接和子结构选择机制,在多个基准数据集上超越了现有最先进的脉冲神经网络生成模型。
English summary: This study introduces a novel diffusion model using spiking neural networks enhanced with lateral connections and a substructure selection mechanism, which demonstrates superior performance over existing SNN-based generative models across multiple benchmarks.

Authors:Zhenyu Tang, Chaoran Feng, Xinhua Cheng, Wangbo Yu, Junwu Zhang, Yuan Liu, Xiaoxiao Long, Wenping Wang, Li Yuan
Title: NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations
Abstract:
3D Gaussian Splatting (3DGS) achieves impressive quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. In this paper, we aim to develop a simple yet effective method called NeuralGS that compresses the original 3DGS into a compact representation. Our observation is that neural fields like NeRF can represent complex 3D scenes with Multi-Layer Perceptron (MLP) neural networks using only a few megabytes. Thus, NeuralGS effectively adopts the neural field representation to encode the attributes of 3D Gaussians with MLPs, only requiring a small storage size even for a large-scale scene. To achieve this, we adopt a clustering strategy and fit the Gaussians within each cluster using different tiny MLPs, based on importance scores of Gaussians as fitting weights. We experiment on multiple datasets, achieving a 91-times average model size reduction without harming the visual quality.
中文:NeuralGS通过MLP和聚类策略将3D高斯泼溅压缩为紧凑的神经表示,在保持视觉质量的同时实现了91倍的模型大小缩减。
English: NeuralGS compresses 3D Gaussian Splatting into a compact neural representation using MLPs and clustering, achieving 91-times size reduction while preserving visual quality.

Authors:Jingye Chen, Yuzhong Zhao, Yupan Huang, Lei Cui, Li Dong, Tengchao Lv, Qifeng Chen, Furu Wei
Title: Model as a Game: On Numerical and Spatial Consistency for Generative Games
Abstract:
Recent advances in generative models have significantly impacted game generation. However, despite producing high-quality graphics and adequately receiving player input, existing models often fail to maintain fundamental game properties such as numerical and spatial consistency. Numerical consistency ensures gameplay mechanics correctly reflect score changes and other quantitative elements, while spatial consistency prevents jarring scene transitions, providing seamless player experiences. In this paper, we revisit the paradigm of generative games to explore what truly constitutes a Model as a Game (MaaG) with a well-developed mechanism. We begin with an empirical study on ``Traveler'', a 2D game created by an LLM featuring minimalist rules yet challenging generative models in maintaining consistency. Based on the DiT architecture, we design two specialized modules: (1) a numerical module that integrates a LogicNet to determine event triggers, with calculations processed externally as conditions for image generation; and (2) a spatial module that maintains a map of explored areas, retrieving location-specific information during generation and linking new observations to ensure continuity. Experiments across three games demonstrate that our integrated modules significantly enhance performance on consistency metrics compared to baselines, while incurring minimal time overhead during inference.
中文摘要:针对现有游戏生成模型常缺乏数值与空间一致性的问题,本文提出两个专用模块,在保持较低计算开销的同时显著提升了游戏机制的连贯性。
English Summary: Recent generative models for games often lack numerical and spatial consistency, so this paper introduces specialized modules that significantly improve these aspects with minimal computational overhead.

Authors:Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang
Title: MMGen: Unified Multi-modal Image Generation and Understanding in One Go
Abstract:
A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.
Chinese: MMGen是一个统一的扩散框架,集成了多模态生成与理解任务,通过创新的扩散变换器和模态解耦策略,实现了无缝可控的跨模态应用。
English: MMGen is a unified diffusion framework that integrates multi-modal generation and understanding tasks, enabling seamless and controllable cross-modal applications through a novel diffusion transformer and modality-decoupling strategy.

Authors:Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
Title: Scaling Laws of Synthetic Data for Language Models
Abstract:
Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.
中文: 大语言模型面临高质量网络数据枯竭的问题,但SynthLLM框架证明合成数据可作为可扩展的有效替代方案,遵循可预测的缩放规律,为模型性能的持续提升提供了可行路径。
English: Large language models face a depletion of high-quality web data, but the SynthLLM framework demonstrates that synthetic data can serve as a scalable and effective alternative, adhering to predictable scaling laws and enabling continued performance improvements.

Authors:Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
Title: Scaling Laws of Synthetic Data for Language Models
Abstract:
Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.
中文: 大语言模型面临高质量网络数据枯竭的问题,但SynthLLM框架证明合成数据可作为可扩展的有效替代方案,遵循可预测的缩放规律,为模型性能的持续提升提供了可行路径。
English: Large language models face a depletion of high-quality web data, but the SynthLLM framework demonstrates that synthetic data can serve as a scalable and effective alternative, adhering to predictable scaling laws and enabling continued performance improvements.

Authors:Ke Niu, Yuwen Chen, Haiyang Yu, Zhuofan Chen, Xianghui Que, Bin Li, Xiangyang Xue
Title: PHT-CAD: Efficient CAD Parametric Primitive Analysis with Progressive Hierarchical Tuning
Abstract:
Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing, yet 2D Parametric Primitive Analysis (PPA) remains underexplored due to two key challenges: structural constraint reasoning and advanced semantic understanding. To tackle these challenges, we first propose an Efficient Hybrid Parametrization (EHP) for better representing 2D engineering drawings. EHP contains four types of atomic component i.e., point, line, circle, and arc). Additionally, we propose PHT-CAD, a novel 2D PPA framework that harnesses the modality alignment and reasoning capabilities of Vision-Language Models (VLMs) for precise engineering drawing analysis. In PHT-CAD, we introduce four dedicated regression heads to predict corresponding atomic components. To train PHT-CAD, a three-stage training paradigm Progressive Hierarchical Tuning (PHT) is proposed to progressively enhance PHT-CAD's capability to perceive individual primitives, infer structural constraints, and align annotation layers with their corresponding geometric representations. Considering that existing datasets lack complete annotation layers and real-world engineering drawings, we introduce ParaCAD, the first large-scale benchmark that explicitly integrates both the geometric and annotation layers. ParaCAD comprises over 10 million annotated drawings for training and 3,000 real-world industrial drawings with complex topological structures and physical constraints for test. Extensive experiments demonstrate the effectiveness of PHT-CAD and highlight the practical significance of ParaCAD in advancing 2D PPA research.
中文摘要:本文提出PHT-CAD框架,利用视觉语言模型解决二维参数化基元分析中的结构约束推理和语义理解难题,并建立包含超千万标注图纸的ParaCAD基准数据集,推动该领域研究发展。
English Summary: This paper introduces PHT-CAD, a novel 2D parametric primitive analysis framework leveraging vision-language models to address structural constraint reasoning and semantic understanding challenges, and presents ParaCAD, a comprehensive benchmark with over 10 million annotated drawings to advance the field.

Authors:Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang
Title: MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments
Abstract:
While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
中文:现有RAG系统的单源知识检索造成了认知与算法间的差距,MoK-RAG通过多源检索和LLM语料库功能分区解决了这一问题,MoK-RAG3D进一步将该框架应用于三维环境生成,采用分层知识树结构和自动化评估方法,实验证明能有效辅助具身AI生成多样化场景。
English: Current RAG systems' single-source knowledge retrieval creates a cognitive-algorithmic gap, which MoK-RAG bridges through multi-source retrieval and functional partitioning of LLM corpora, with MoK-RAG3D extending this to 3D environment generation using hierarchical knowledge trees and automated evaluation methods.

Authors:Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
Title: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Abstract:
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments. Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction. However, these methods quantize actions into discrete bins, which disrupts the continuity required for precise control. In contrast, existing diffusion-based VLA methods incorporate an additional diffusion head to predict continuous actions solely conditioned on feature representations extracted by the VLM, without fully leveraging the VLM's pretrained reasoning capabilities through token-level generation. To address these limitations, we introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression within a single large language model. To mitigate interference between the two generation paradigms, we propose a collaborative training recipe that seamlessly incorporates diffusion denoising into the next-token prediction process. With this recipe, we find these two action prediction methods not only reinforce each other but also exhibit varying strength across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses both predictions, leading to more robust control. HybridVLA outperforms previous state-of-the-art VLA methods by 14\% and 19\% in mean success rate on simulation and real-world tasks, respectively, while demonstrating stable manipulation in unseen configurations.
Chinese: HybridVLA是一个统一框架,将扩散模型的连续动作生成与自回归方法的上下文推理能力结合在单一大型语言模型中,在机器人操作任务中实现了卓越性能。
English: HybridVLA is a unified framework that combines the continuous action generation of diffusion models with the contextual reasoning of autoregressive methods within a single large language model, achieving superior performance in robotic manipulation tasks.

Authors:Henglyu Liu, Andong Chen, Kehai Chen, Xuefeng Bai, Meizhi Zhong, Yuan Qiu, Min Zhang
Title: Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
Abstract:
Recent advancement of large language models (LLMs) has led to significant breakthroughs across various tasks, laying the foundation for the development of LLM-based speech translation systems. Existing methods primarily focus on aligning inputs and outputs across modalities while overlooking deeper semantic alignment within model representations. To address this limitation, we propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within LLMs. To achieve this, we leverage the optimal transport (OT) theory to quantify fine-grained representation discrepancies between speech and text. Furthermore, we utilize the cross-modal retrieval technique to identify the layers that are best suited for alignment and perform joint training on these layers. Experimental results on speech translation (ST) tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches. Our findings highlight the importance of inner-layer speech-text alignment in LLMs and provide new insights into enhancing cross-modal learning.
大语言模型通过提出一种自适应方法,利用最优传输理论在特定层对齐语音与文本表征,显著提升了语音翻译性能,超越了现有最佳方法。
Large language models are advancing speech translation by introducing an adaptive method that aligns speech and text representations at specific layers using optimal transport theory, significantly improving performance over existing approaches.

Authors:Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang
Title: Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Abstract:
Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the conversion of the sampling process from the target policy into a computationally efficient re-ranking of preference data. Building on this hypothesis, we propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.
中文: 该框架通过将在线采样转化为基于模型内在安全判断的偏好重排序,有效解决了强化学习安全对齐中的分布偏移问题,在显著提升安全性能的同时减少了300倍计算开销。
English: The proposed framework addresses distribution shift in RL safety alignment by converting online sampling into efficient preference re-ranking using the model's intrinsic safety judgments, significantly improving safety performance while reducing computational costs by 300 times.

Authors:Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang
Title: XIFBench: Evaluating Large Language Models on Multilingual Instruction Following
Abstract:
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.
中文: XIFBench作为基于约束的多语言指令遵循基准,揭示了大型语言模型在不同语言中受约束类型、复杂性和文化因素影响的性能差异。
English: XIFBench is introduced as a constraint-based benchmark to evaluate multilingual instruction-following in LLMs, revealing performance variations influenced by constraint types, complexity, and cultural factors across languages.

Authors:Mufan Xu, Gewen Liang, Kehai Chen, Wei Wang, Xun Zhou, Muyun Yang, Tiejun Zhao, Min Zhang
Title: Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning
Abstract:
Large language models (LLMs) have achieved remarkable performance on knowledge graph question answering (KGQA) tasks by planning and interacting with knowledge graphs. However, existing methods often confuse tool utilization with knowledge reasoning, harming readability of model outputs and giving rise to hallucinatory tool invocations, which hinder the advancement of KGQA. To address this issue, we propose Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning (MemQ) to decouple LLM from tool invocation tasks using LLM-built query memory. By establishing a memory module with explicit descriptions of query statements, the proposed MemQ facilitates the KGQA process with natural language reasoning and memory-augmented query reconstruction. Meanwhile, we design an effective and readable reasoning to enhance the LLM's reasoning capability in KGQA. Experimental results that MemQ achieves state-of-the-art performance on widely used benchmarks WebQSP and CWQ.
Chinese: 大语言模型在知识图谱问答中表现出色,但常混淆工具使用与知识推理,导致幻觉问题;提出的MemQ方法通过查询记忆解耦这两者,在WebQSP和CWQ基准测试中取得了最优性能。
English: Large language models (LLMs) excel in knowledge graph question answering but often confuse tool use with reasoning, leading to hallucinations; the proposed Memory-augmented Query Reconstruction (MemQ) method decouples these processes using query memory to achieve state-of-the-art results on benchmarks.

Authors:Xingzuo Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yong Xu, Min Zhang
Title: Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Abstract:
Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.
中文: 本文提出GA-Rollback框架,通过生成器与环境交互、助手检测动作的协作机制,在发现错误时触发回滚操作,显著提升大语言模型智能体在多项基准测试中的表现,并能作为即插即用模块灵活集成。
English: This paper introduces GA-Rollback, a framework that enables large language model agents to detect and rollback incorrect actions through generator-assistant collaboration, significantly improving performance across multiple benchmarks while serving as a plug-and-play module.

Authors:Xingzuo Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yong Xu, Min Zhang
Title: Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Abstract:
Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.
中文: 本文提出GA-Rollback框架,通过生成器与环境交互、助手检测动作的协作机制,在发现错误时触发回滚操作,显著提升大语言模型智能体在多项基准测试中的表现,并能作为即插即用模块灵活集成。
English: This paper introduces GA-Rollback, a framework that enables large language model agents to detect and rollback incorrect actions through generator-assistant collaboration, significantly improving performance across multiple benchmarks while serving as a plug-and-play module.

Authors:Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang
Title: EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
Abstract:
Rotary Position Embedding (RoPE) enables each attention head to capture multi-frequency information along the sequence dimension and is widely applied in foundation models. However, the nonlinearity introduced by RoPE complicates optimization of the key state in the Key-Value (KV) cache for RoPE-based attention. Existing KV cache compression methods typically store key state before rotation and apply the transformation during decoding, introducing additional computational overhead. This paper introduces EliteKV, a flexible modification framework for RoPE-based models supporting variable KV cache compression ratios. EliteKV first identifies the intrinsic frequency preference of each head using RoPElite, selectively restoring linearity to certain dimensions of key within attention computation. Building on this, joint low-rank compression of key and value enables partial cache sharing. Experimental results show that with minimal uptraining on only $0.6\%$ of the original training data, RoPE-based models achieve a $75\%$ reduction in KV cache size while preserving performance within a negligible margin. Furthermore, EliteKV consistently performs well across models of different scales within the same family.
中文摘要:EliteKV是一种灵活的框架,通过选择性恢复线性度和联合低秩压缩,使基于RoPE的模型在仅需少量再训练的情况下,将KV缓存大小减少75%同时保持性能基本不变。
English Summary: EliteKV is a flexible framework that reduces the KV cache size by 75% in RoPE-based models through selective linearity restoration and joint low-rank compression, maintaining performance with minimal retraining.

Authors:Jingyi Zhou, Peng Ye, Haoyu Zhang, Jiakang Yuan, Rao Qiang, Liu YangChenXu, Wu Cailin, Feng Xu, Tao Chen
Title: Consistency-aware Self-Training for Iterative-based Stereo Matching
Abstract:
Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model prediction.Based on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.
中文: 提出的感知一致性自训练框架通过软过滤模块和加权损失提升伪标签可靠性并减轻误差累积,从而在各种场景和基准数据集上显著优化了基于迭代的立体匹配方法性能。
English: The proposed consistency-aware self-training framework enhances iterative-based stereo matching by using a soft filtering module and weighted loss to improve pseudo-label reliability and mitigate error accumulation, achieving superior performance across various scenarios and benchmarks.

Authors:Peter Schafhalter, Alexander Krentsel, Joseph E. Gonzalez, Sylvia Ratnasamy, Scott Shenker, Ion Stoica
Title: Bandwidth Allocation for Cloud-Augmented Autonomous Driving
Abstract:
Autonomous vehicle (AV) control systems increasingly rely on ML models for tasks such as perception and planning. Current practice is to run these models on the car's local hardware due to real-time latency constraints and reliability concerns, which limits model size and thus accuracy. Prior work has observed that we could augment current systems by running larger models in the cloud, relying on faster cloud runtimes to offset the cellular network latency. However, prior work does not account for an important practical constraint: limited cellular bandwidth. We show that, for typical bandwidth levels, proposed techniques for cloud-augmented AV models take too long to transfer data, thus mostly falling back to the on-car models and resulting in no accuracy improvement. In this work, we show that realizing cloud-augmented AV models requires intelligent use of this scarce bandwidth, i.e. carefully allocating bandwidth across tasks and providing multiple data compression and model options. We formulate this as a resource allocation problem to maximize car utility, and present our system \sysname which achieves an increase in average model accuracy by up to 15 percentage points on driving scenarios from the Waymo Open Dataset.
中文摘要:通过智能分配有限带宽并采用数据压缩技术,自动驾驶车辆可利用云端模型提升精度,在Waymo数据集场景中实现高达15%的性能提升。
English Summary: Autonomous vehicles can enhance model accuracy by using cloud-based systems with intelligent bandwidth allocation and data compression, overcoming cellular limitations to improve performance by up to 15%.

Authors:Xiran Wang, Jian Zhang, Lei Qi, Yinghuan Shi
Title: Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
Abstract:
Domain generalization is proposed to address distribution shift, arising from statistical disparities between training source and unseen target domains. The widely used first-order meta-learning algorithms demonstrate strong performance for domain generalization by leveraging the gradient matching theory, which aims to establish balanced parameters across source domains to reduce overfitting to any particular domain. However, our analysis reveals that there are actually numerous directions to achieve gradient matching, with current methods representing just one possible path. These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. To address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. This approach, while adhering to the principles of gradient matching, promotes a more precise balance by estimating the centroid between domain-specific optimal parameters. Experimental results validate the effectiveness of our strategy.
中文: 现有领域泛化的元学习方法虽实现梯度匹配,却忽略了平衡参数应接近各源域最优参数质心,因此提出一种简单的算术元学习方法,通过算术加权梯度更精确地估计质心以提升平衡性。
English: Current meta-learning methods for domain generalization achieve gradient matching but overlook the need for balanced parameters near the centroid of optimal domain-specific parameters, prompting the proposal of a simple arithmetic meta-learning approach that enhances balance through arithmetic-weighted gradients.

Authors:Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen
Title: FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.
中文: 研究者提出了FAVOR-Bench基准,揭示了多模态大语言模型在动作理解上的局限,并通过FAVOR-Train数据集进行微调,有效提升了模型在动作相关任务上的表现。
English: Researchers introduce FAVOR-Bench, a benchmark revealing limitations in multimodal large language models' motion understanding, and FAVOR-Train, a dataset that improves model performance when fine-tuned.

Authors:Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Title: Why Do Multi-Agent LLM Systems Fail?
Abstract:
Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks often remain minimal compared with single-agent frameworks. This gap highlights the need to systematically analyze the challenges hindering MAS effectiveness. We present MAST (Multi-Agent System Failure Taxonomy), the first empirically grounded taxonomy designed to understand MAS failures. We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators. Through this process, we identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification. MAST emerges iteratively from rigorous inter-annotator agreement studies, achieving a Cohen's Kappa score of 0.88. To support scalable evaluation, we develop a validated LLM-as-a-Judge pipeline integrated with MAST. We leverage two case studies to demonstrate MAST's practical utility in analyzing failures and guiding MAS development. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open source our comprehensive dataset and LLM annotator to facilitate further development of MAS.
中文摘要:多智能体大语言模型系统相比单智能体框架性能提升有限,为此我们提出首个实证驱动的MAST失败分类法,通过分析200多项任务识别出三大类14种失败模式,为系统改进提供明确路线图。
English Summary: Multi-agent LLM systems show limited performance improvements over single-agent approaches, prompting the development of MAST—an empirical failure taxonomy identifying 14 failure modes across three categories to guide future system enhancements.

Authors:Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu
Title: Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Abstract:
Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.
中文: 本文提出B²-DiffuRL强化学习框架,通过逆向渐进训练和分支采样策略解决扩散模型训练中的稀疏奖励问题,有效提升文本-图像对齐质量并保持生成多样性。
English: This paper introduces B²-DiffuRL, a reinforcement learning framework that overcomes sparse reward limitations in diffusion model fine-tuning through backward progressive training and branch-based sampling to enhance text-image alignment while preserving diversity.

Authors:Yuanpeng Zheng, Tiankui Zhang, Xidong Mu, Yuanwei Liu, Rong Huang
Title: Joint Semantic Transmission and Resource Allocation for Intelligent Computation Task Offloading in MEC Systems
Abstract:
Mobile edge computing (MEC) enables the provision of high-reliability and low-latency applications by offering computation and storage resources in close proximity to end-users. Different from traditional computation task offloading in MEC systems, the large data volume and complex task computation of artificial intelligence involved intelligent computation task offloading have increased greatly. To address this challenge, we propose a MEC system for multiple base stations and multiple terminals, which exploits semantic transmission and early exit of inference. Based on this, we investigate a joint semantic transmission and resource allocation problem for maximizing system reward combined with analysis of semantic transmission and intelligent computation process. To solve the formulated problem, we decompose it into communication resource allocation subproblem, semantic transmission subproblem, and computation capacity allocation subproblem. Then, we use 3D matching and convex optimization method to solve subproblems based on the block coordinate descent (BCD) framework. The optimized feasible solutions are derived from an efficient BCD based joint semantic transmission and resource allocation algorithm in MEC systems. Our simulation demonstrates that: 1) The proposed algorithm significantly improves the delay performance for MEC systems compared with benchmarks; 2) The design of transmission mode and early exit of inference greatly increases system reward during offloading; and 3) Our proposed system achieves efficient utilization of resources from the perspective of system reward in the intelligent scenario.
中文: 该系统通过语义传输和推理提前退出机制,结合基于BCD框架的联合优化算法,显著提升了边缘计算的延迟性能和系统奖励,实现了智能场景下的高效资源利用。
English: The proposed MEC system utilizes semantic transmission and early inference exit to optimize joint semantic transmission and resource allocation, significantly improving delay performance and system reward through an efficient BCD-based algorithm.

Authors:Zhangyu Lai, Yilin Lu, Xinyang Li, Jianghang Lin, Yansong Qu, Liujuan Cao, Ming Li, Rongrong Ji
Title: AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis
Abstract:
While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.
中文: AnomalyPainter是一种零样本框架,通过融合视觉语言大模型、潜在扩散模型和Tex-9K纹理库,突破了异常合成中多样性与真实性的权衡困境,实验证明其在真实感、多样性和泛化性方面均优于现有方法。
English: AnomalyPainter is a zero-shot framework that overcomes the diversity-realism trade-off in anomaly synthesis by integrating Vision Language Large Models, Latent Diffusion Models, and the Tex-9K texture library to generate realistic and diverse anomalies, demonstrating superior performance in experiments.

Authors:Quanjian Song, Zhihang Lin, Zhanpeng Zeng, Ziyue Zhang, Liujuan Cao, Rongrong Ji
Title: LightMotion: A Light and Tuning-free Method for Simulating Camera Motion in Video Generation
Abstract:
Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively.
中文:LightMotion提出了一种无需调优的潜在空间相机运动视频生成方法,通过置换、重采样和噪声校正来提升质量,并在实验中优于现有技术。
English: LightMotion introduces a tuning-free method for camera motion video generation in latent space, utilizing permutation and resampling with noise correction to enhance quality and outperform existing approaches.

Authors:Eren Erogullari, Sebastian Lapuschkin, Wojciech Samek, Frederik Pahde
Title: Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations
Abstract:
Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.
中文: 本文提出了一种后处理概念解耦方法,通过非正交性损失识别神经网络中的正交概念方向,从而在激活引导中实现孤立概念的精准插入和移除,同时减少对相关概念的影响。
English: This paper introduces a post-hoc concept disentanglement method that uses a non-orthogonality loss to identify orthogonal concept directions in neural networks, improving activation steering by enabling isolated concept insertion and removal with minimal impact on correlated concepts.

Authors:Weicong Qin, Yi Xu, Weijie Yu, Chenglei Shen, Ming He, Jianping Fan, Xiao Zhang, Jun Xu
Title: MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment
Abstract:
Personalized product search aims to retrieve and rank items that match users' preferences and search intent. Despite their effectiveness, existing approaches typically assume that users' query fully captures their real motivation. However, our analysis of a real-world e-commerce platform reveals that users often engage in relevant consultations before searching, indicating they refine intents through consultations based on motivation and need. The implied motivation in consultations is a key enhancing factor for personalized search. This unexplored area comes with new challenges including aligning contextual motivations with concise queries, bridging the category-text gap, and filtering noise within sequence history. To address these, we propose a Motivation-Aware Personalized Search (MAPS) method. It embeds queries and consultations into a unified semantic space via LLMs, utilizes a Mixture of Attention Experts (MoAE) to prioritize critical semantics, and introduces dual alignment: (1) contrastive learning aligns consultations, reviews, and product features; (2) bidirectional attention integrates motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic data show MAPS outperforms existing methods in both retrieval and ranking tasks.
中文: 该摘要提出MAPS方法,通过分析用户咨询推断潜在动机,利用大语言模型和混合注意力专家对齐语义,提升个性化搜索效果,实验证明其在检索和排序任务中优于现有方法。
English: The abstract introduces MAPS, a novel personalized search method that leverages user consultations to infer underlying motivations, using LLMs and a Mixture of Attention Experts to align semantics and enhance search relevance, demonstrating superior performance in experiments.

Authors:Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, Tao Chen
Title: DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
Abstract:
Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.
中文摘要:DeRS范式通过将专家分解为共享基础权重和轻量级增量权重,显著提升了升级版混合专家模型的参数效率,在保持性能的同时实现了训练与推理阶段的极致压缩。
English Summary: The DeRS paradigm enhances parameter efficiency in upcycled Mixture-of-Experts models by decomposing experts into shared base weights and lightweight delta weights, achieving extreme compression while maintaining performance across training and inference scenarios.

Authors:Dominique Nshimyimana, Vitor Fortes Rey, Sungho Suh, Bo Zhou, Paul Lukowicz
Title: PIM: Physics-Informed Multi-task Pre-training for Improving Inertial Sensor-Based Human Activity Recognition
Abstract:
Human activity recognition (HAR) with deep learning models relies on large amounts of labeled data, often challenging to obtain due to associated cost, time, and labor. Self-supervised learning (SSL) has emerged as an effective approach to leverage unlabeled data through pretext tasks, such as masked reconstruction and multitask learning with signal processing-based data augmentations, to pre-train encoder models. However, such methods are often derived from computer vision approaches that disregard physical mechanisms and constraints that govern wearable sensor data and the phenomena they reflect. In this paper, we propose a physics-informed multi-task pre-training (PIM) framework for IMU-based HAR. PIM generates pre-text tasks based on the understanding of basic physical aspects of human motion: including movement speed, angles of movement, and symmetry between sensor placements. Given a sensor signal, we calculate corresponding features using physics-based equations and use them as pretext tasks for SSL. This enables the model to capture fundamental physical characteristics of human activities, which is especially relevant for multi-sensor systems. Experimental evaluations on four HAR benchmark datasets demonstrate that the proposed method outperforms existing state-of-the-art methods, including data augmentation and masked reconstruction, in terms of accuracy and F1 score. We have observed gains of almost 10\% in macro f1 score and accuracy with only 2 to 8 labeled examples per class and up to 3% when there is no reduction in the amount of training data.
Chinese: 本文提出了一种基于物理信息的多任务预训练(PIM)框架,通过将人体运动的物理特性融入自监督学习来提升活动识别性能,在多个基准数据集上实现了优于现有方法的准确率和F1分数。
English: The paper introduces a physics-informed multi-task pre-training (PIM) framework that enhances human activity recognition by incorporating physical aspects of motion into self-supervised learning, achieving superior accuracy and F1 scores on benchmark datasets.

Authors:Sicheng Zhou, Zhuozhao Li, Valérie Hayot-Sasson, Haochen Pan, Maxime Gonthier, J. Gregory Pauloski, Ryan Chard, Kyle Chard, Ian Foster
Title: WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks
Abstract:
Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.
中文: WRATH针对基于任务的并行编程提出了一种系统性故障处理方法,通过分层分类故障并利用实时监控与定制响应,显著提高了任务成功率并缩短了故障时间。
English: WRATH introduces a systematic failure-handling approach for Task-based Parallel Programming by categorizing failures across framework layers and employing real-time monitoring with tailored responses, significantly boosting task success rates and reducing failure times.

Authors:Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang, Enhong Chen
Title: A Survey on Knowledge-Oriented Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.
中文: 本综述全面探讨了检索增强生成(RAG)技术,系统阐述了其核心架构、应用场景与发展挑战,并强调其通过动态外部知识融合推动自然语言处理发展的潜力。
English: This survey comprehensively examines Retrieval-Augmented Generation (RAG), detailing its components, applications, and challenges while highlighting its potential to enhance natural language processing through dynamic external knowledge integration.

Authors:Zisheng Chen, Chunwei Wang, Xiuwei Chen, Hongbin Xu, Runhui Huang, Jun Zhou, Jianhua Han, Hang Xu, Xiaodan Liang
Title: SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Abstract:
In this paper, we introduce SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete representations for multimodal understanding and generation. Recently, unified image tokenizers have sparked exploration within research community, which is designed to capture high-level semantic features for understanding and retaining low-level pixel features for generation. Previous works attempt to train a unified image tokenizer by combining loss for semantic distillation and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through a novel semantic-guided hierarchical codebook, which builds pixel sub-codebooks on a pretrained semantic codebook. This design decouples semantic and pixel both in terms of structure and training strategy, enabling the tokenizer to capture pixel features while retaining its ability to comprehend high-level semantic information. Our experiments demonstrate that SemHiTok achieves SOTA performance in image reconstruction and multimodal understanding under LLaVA-v1.5 setting. Further, we develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks. For understanding, SemHiTok achieves impressive performance on most benchmarks. For generation, our model achieves SOTA performance on MJHQ30K in unified MLLMs.
中文: 本文提出SemHiTok,一种基于语义引导分层码表的统一图像分词器,能有效兼顾高级语义理解与低级像素特征,在多模态任务中表现卓越。
English: This paper presents SemHiTok, a unified image tokenizer using a semantic-guided hierarchical codebook that effectively balances high-level semantic understanding and low-level pixel features for superior multimodal tasks.

Authors:Yi-Kai Zhang, Jin Wang, Xu-Xiang Zhong, De-Chuan Zhan, Han-Jia Ye
Title: Model Assembly Learning with Heterogeneous Layer Weight Merging
Abstract:
Model merging acquires general capabilities without extra data or training by combining multiple models' parameters. Previous approaches achieve linear mode connectivity by aligning parameters into the same loss basin using permutation invariance. In this paper, we introduce Model Assembly Learning (MAL), a novel paradigm for model merging that iteratively integrates parameters from diverse models in an open-ended model zoo to enhance the base model's capabilities. Unlike previous works that require identical architectures, MAL allows the merging of heterogeneous architectures and selective parameters across layers. Specifically, the base model can incorporate parameters from different layers of multiple pre-trained models. We systematically investigate the conditions and fundamental settings of heterogeneous parameter merging, addressing all possible mismatches in layer widths between the base and target models. Furthermore, we establish key laws and provide practical guidelines for effectively implementing MAL.
中文摘要:模型装配学习(MAL)提出了一种新颖的模型融合范式,通过迭代整合来自不同预训练模型的异构架构参数,并建立了系统化的实施准则。
English Summary: Model Assembly Learning (MAL) introduces a novel paradigm for merging heterogeneous model architectures by iteratively integrating selective parameters from diverse pre-trained models, establishing systematic guidelines for effective implementation.

Authors:Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng
Title: A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
Abstract:
Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.
中文摘要:近期大型推理模型虽表现强劲,但其推理过程冗长低效,存在内容重复和过度分析等问题,给实际部署带来挑战,本综述系统梳理了提升推理效率的方法并展望未来研究方向。
English Summary: Recent Large Reasoning Models show strong performance but suffer from inefficient, overly lengthy reasoning traces filled with redundancies, posing challenges for deployment where token economy is crucial, prompting this survey to examine efficiency improvements across the model lifecycle.

Authors:Sicong Liu, Yang Shu, Chenjuan Guo, Bin Yang
Title: Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation
Abstract:
Learning cooperative multi-agent policy from offline multi-task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi-task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lacks bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task-specific skills in each task for fine-grained action execution. To tackle these challenges, we propose Hierarchical and Separate Skill Discovery (HiSSD), a novel approach for generalizable offline multi-task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task-specific skills. The common skills learn cooperative temporal knowledge and enable in-sample exploitation for offline multi-task MARL. The task-specific skills represent the priors of each task and achieve a task-guided fine-grained action execution. To verify the advancement of our method, we conduct experiments on multi-agent MuJoCo and SMAC benchmarks. After training the policy using HiSSD on offline multi-task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.
中文: 本文提出HiSSD分层框架,通过联合学习通用技能与任务特定技能,解决离线多任务多智能体强化学习中的协作行为提取与精细动作执行问题,在基准测试中展现出对未知任务的优越泛化性能。
English: This abstract introduces HiSSD, a hierarchical framework for offline multi-task multi-agent reinforcement learning that jointly learns common and task-specific skills to enhance cooperative behavior and task adaptation, demonstrating superior performance in unseen scenarios on benchmarks.

Authors:Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, Bo Zheng
Title: Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation
Abstract:
Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.
中文摘要:本研究质疑了现有蒸馏方法中数据通用性的假设,提出DLCoT框架通过数据分割、简化和优化来增强长思维链推理的蒸馏效果,显著提升了模型性能与效率。
English Summary: This study challenges the assumed universality of distillation data in transferring long chain-of-thought reasoning capabilities between models and proposes the DLCoT framework to enhance distillation through data segmentation, simplification, and optimization, significantly improving model performance and efficiency.

Authors:Feiyang Li, Yingjian Chen, Haoran Liu, Rui Yang, Han Yuan, Yuang Jiang, Tianxiao Li, Edison Marrese Taylor, Hossein Rouhizadeh, Yusuke Iwasawa, Douglas Teodoro, Yutaka Matsuo, Irene Li
Title: MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering
Abstract:
Large Language Models (LLMs) have shown remarkable progress in medical question answering (QA), yet their effectiveness remains predominantly limited to English due to imbalanced multilingual training data and scarce medical resources for low-resource languages. To address this critical language gap in medical QA, we propose Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank), a knowledge graph-enhanced framework that enables English-centric LLMs to perform multilingual medical QA. Through a word-level translation mechanism, our framework efficiently integrates comprehensive English-centric medical knowledge graphs into LLM reasoning at a low cost, mitigating cross-lingual semantic distortion and achieving precise medical QA across language barriers. To enhance efficiency, we introduce caching and multi-angle ranking strategies to optimize the retrieval process, significantly reducing response times and prioritizing relevant medical knowledge. Extensive evaluations on multilingual medical QA benchmarks across Chinese, Japanese, Korean, and Swahili demonstrate that MKG-Rank consistently outperforms zero-shot LLMs, achieving maximum 35.03% increase in accuracy, while maintaining an average retrieval time of only 0.0009 seconds.
Chinese: 为解决大语言模型在跨语言医疗问答中的局限,我们提出MKG-Rank框架,通过知识图谱增强和翻译机制整合英语医疗知识,在中文、日文、韩文和斯瓦希里语的测试中准确率最高提升35.03%,且平均检索时间仅0.0009秒。
English: To overcome the language limitations of English-centric LLMs in multilingual medical QA, we developed MKG-Rank, a knowledge graph-enhanced framework that integrates English medical knowledge via translation and optimized retrieval, achieving up to 35.03% higher accuracy with minimal response time across multiple languages.

Authors:Langming Liu, Haibin Chen, Yuhao Wang, Yujin Yuan, Shilei Liu, Wenbo Su, Xiangyu Zhao, Bo Zheng
Title: ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
Abstract:
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.
中文摘要:大语言模型在电商领域潜力巨大,但存在事实性不足的问题,为此我们提出ECKGBench数据集,通过自动化生成问题和电商专业知识注入,高效评估模型的知识能力并探索其边界。
English Summary: Large language models show great promise in e-commerce applications but face challenges with factuality, leading to the development of ECKGBench, a specialized dataset for evaluating their knowledge in this domain through efficient question-answering and expert-driven methods.

Authors:Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang
Title: MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Abstract:
This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/
Chinese: 本文提出MotionStreamer框架,通过将连续因果潜在空间融入概率自回归模型,解决了现有方法在文本条件流式动作生成中的延迟响应和误差累积问题,实现了精确的在线动作解码。
English: This paper introduces MotionStreamer, a novel framework that overcomes limitations of existing methods by using a continuous causal latent space in a probabilistic autoregressive model to enable accurate, real-time text-conditioned streaming motion generation without error accumulation.

Authors:Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Hakim Hacid, Flora D. Salim, Imran Razzak
Title: Long Context Modeling with Ranked Memory-Augmented Retrieval
Abstract:
Effective long-term memory management is crucial for language models handling extended contexts. We introduce a novel framework that dynamically ranks memory entries based on relevance. Unlike previous works, our model introduces a novel relevance scoring and a pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques in information retrieval. Enhanced Ranked Memory Augmented Retrieval ERMAR achieves state-of-the-art results on standard benchmarks.
中文:提出的增强排序记忆检索框架通过采用新颖的相关性评分和重排序技术,动态管理语言模型的长期记忆,在标准基准测试中取得了最优性能。
English: The proposed ERMAR framework enhances long-term memory management in language models by dynamically ranking memory entries with novel relevance scoring and re-ranking techniques, achieving state-of-the-art performance.

Authors:Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, Hujun Bao
Title: Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
Abstract:
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .
中文: 本文提出一种将运动恢复结构信息融入深度估计的新方法,显著提升多视角几何重建质量,在多种场景下超越现有先进技术。
English: This paper introduces a method that enhances multi-view geometric reconstruction by integrating Structure from Motion (SfM) information into depth estimation, improving accuracy and outperforming existing techniques across diverse scenes.

Authors:Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim
Title: Beyond Single Pass, Looping Through Time: KG-IRAG with Iterative Knowledge Retrieval
Abstract:
Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs' ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG's performance, demonstrating its potential beyond traditional RAG applications.
中文: GraphRAG通过知识图谱提升大语言模型的信息检索能力,但在多步推理方面存在不足,因此提出KG-IRAG框架,结合迭代推理处理时序和逻辑依赖问题,在天气和交通等复杂任务中提高准确性,并通过新数据集验证其效果。
English: GraphRAG enhances LLMs by using knowledge graphs for better information retrieval, but struggles with multi-step reasoning, leading to the development of KG-IRAG, which integrates iterative reasoning to improve accuracy in complex tasks like weather and traffic analysis, as validated by new datasets.

Authors:Dehai Zhao, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu
Title: SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation
Abstract:
UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support, or assume specific app implementations. Reverse engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognizing human-understandable structured user actions ([command] [widget] [location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model that can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record user interactions with Word, Zoom, Firefox, Photoshop, and Windows 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.
中文: 提出了一种基于深度学习的计算机视觉模型,通过联合学习从屏幕录像中识别结构化用户操作,填补了非侵入式UI自动化的空白,能够辨识命令、控件和位置,并在大规模数据集上验证了其有效性,适用于错误复现等场景。
English: A deep learning-based computer vision model is proposed to recognize structured user actions from screencasts, addressing the gap in non-intrusive UI automation by identifying commands, widgets, and locations through joint learning on a large dataset, with proven effectiveness for applications like bug reproduction.

Authors:Mingyang Song, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Title: From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Abstract:
Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.
中文: 大型视觉语言模型存在长尾数据不平衡问题,本文提出的自适应数据优化框架通过数据重平衡与合成策略,在不增加数据量的情况下有效提升了模型性能。
English: Large Vision-Language Models face long-tail data imbalance issues, which are addressed by the proposed Adaptive Data Refinement framework through data rebalancing and synthesis to enhance performance without increasing data volume.

Authors:Abhilasha Ravichander, Jillian Fisher, Taylor Sorensen, Ximing Lu, Yuchen Lin, Maria Antoniak, Niloofar Mireshghallah, Chandra Bhagavatula, Yejin Choi
Title: Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models
Abstract:
High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work, we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model's ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.
中文摘要:本研究提出一种无需模型访问权限的新方法,通过信息引导探针利用高惊异度文本来识别GPT-4等专有大语言模型中记忆的训练数据,成功检测出大量被记忆的文本内容。
English Summary: A novel method using information-guided probes is introduced to identify training data memorized by proprietary LLMs like GPT-4 without requiring model access, by leveraging high-surprisal text passages to detect memorized content.

Authors:Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, Felix Juefei-Xu, Foutse Khomh, Osamu Yoshie, Qingyu Chen, Douglas Teodoro, Nan Liu, Randy Goebel, Lei Ma, Edison Marrese-Taylor, Shijian Lu, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Title: MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Abstract:
Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
中文摘要:MMLU-ProX是一个涵盖29种语言的多语言评估基准,通过平行问题设计揭示了大型语言模型在不同语言间的显著能力差异,特别是在低资源语言中表现下降高达24.3%,旨在推动更具包容性的人工智能系统发展。
English Summary: MMLU-ProX is a new multilingual benchmark with parallel questions across 29 languages that reveals significant performance gaps in large language models, particularly showing up to 24.3% lower accuracy in low-resource languages compared to high-resource ones.

Authors:Bin Yang, Yuxuan Liang, Chenjuan Guo, Christian S. Jensen
Title: Data Driven Decision Making with Time Series and Spatio-temporal Data
Abstract:
Time series data captures properties that change over time. Such data occurs widely, ranging from the scientific and medical domains to the industrial and environmental domains. When the properties in time series exhibit spatial variations, we often call the data spatio-temporal. As part of the continued digitalization of processes throughout society, increasingly large volumes of time series and spatio-temporal data are available. In this tutorial, we focus on data-driven decision making with such data, e.g., enabling greener and more efficient transportation based on traffic time series forecasting. The tutorial adopts the holistic paradigm of ``data-governance-analytics-decision.'' We first introduce the data foundation of time series and spatio-temporal data, which is often heterogeneous. Next, we discuss data governance methods that aim to improve data quality. We then cover data analytics, focusing on the ``AGREE'' principles: Automation, Generalization, Robustness, Explainability, and Efficiency. We finally cover data-driven decision making strategies and briefly discuss promising research directions. We hope that the tutorial will serve as a primary resource for researchers and practitioners who are interested in value creation from time series and spatio-temporal data.
中文摘要:本教程围绕时序和时空数据,系统介绍了从数据治理到分析应用(遵循AGREE原则)的完整决策框架,旨在为跨领域的数据价值挖掘提供方法论支持。
English Summary: This tutorial provides a comprehensive framework for data-driven decision-making using time series and spatio-temporal data, covering data governance, analytics following the AGREE principles, and implementation strategies to create value across various domains.

Authors:Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister
Title: In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
Abstract:
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs' cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
Large Language Models face challenges in retaining long-term dialogue information due to rigid memory granularity and fixed retrieval mechanisms, but the proposed Reflective Memory Management (RMM) addresses these by dynamically summarizing interactions and refining retrieval through reinforcement learning, achieving over 10% accuracy improvement in benchmarks.
English Summary:

Authors:Xinjie Zhao, Fan Gao, Xingyu Song, Yingjian Chen, Rui Yang, Yanran Fu, Yuyang Wang, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Title: ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
Abstract:
Recent advances in large language models (LLMs) have significantly improved multi-hop question answering (QA) through direct Chain-of-Thought (CoT) reasoning. However, the irreversible nature of CoT leads to error accumulation, making it challenging to correct mistakes in multi-hop reasoning. This paper introduces ReAgent: a Reversible multi-Agent collaborative framework augmented with explicit backtracking mechanisms, enabling reversible multi-hop reasoning. By incorporating text-based retrieval, information aggregation and validation, our system can detect and correct errors mid-reasoning, leading to more robust and interpretable QA outcomes. The framework and experiments serve as a foundation for future work on error-tolerant QA systems. Empirical evaluations across three benchmarks indicate ReAgent's efficacy, yielding average about 6\% improvements against baseline models.
中文: 本文提出ReAgent,一种具备回溯机制的可逆多智能体框架,能在多步推理中检测并纠正错误,在三个基准测试上相比基线模型平均提升约6%。
English: This paper introduces ReAgent, a reversible multi-agent framework with backtracking that detects and corrects errors during multi-hop reasoning, achieving about 6% average improvement over baselines on three benchmarks.

Authors:Bo Yuan, Yulin Chen, Zhen Tan, Wang Jinyan, Huan Liu, Yin Zhang
Title: Label Distribution Learning-Enhanced Dual-KNN for Text Classification
Abstract:
Many text classification methods usually introduce external information (e.g., label descriptions and knowledge bases) to improve the classification performance. Compared to external information, some internal information generated by the model itself during training, like text embeddings and predicted label probability distributions, are exploited poorly when predicting the outcomes of some texts. In this paper, we focus on leveraging this internal information, proposing a dual $k$ nearest neighbor (D$k$NN) framework with two $k$NN modules, to retrieve several neighbors from the training set and augment the distribution of labels. For the $k$NN module, it is easily confused and may cause incorrect predictions when retrieving some nearest neighbors from noisy datasets (datasets with labeling errors) or similar datasets (datasets with similar labels). To address this issue, we also introduce a label distribution learning module that can learn label similarity, and generate a better label distribution to help models distinguish texts more effectively. This module eases model overfitting and improves final classification performance, hence enhancing the quality of the retrieved neighbors by $k$NN modules during inference. Extensive experiments on the benchmark datasets verify the effectiveness of our method.
中文摘要:本文提出了一种双k近邻(DkNN)框架,利用模型内部信息(如文本嵌入和标签分布)改进文本分类,并通过标签分布学习模块处理噪声数据,有效提升分类性能。
English Summary: This paper introduces a dual k-nearest neighbor (DkNN) framework that leverages internal model information like embeddings and label distributions to improve text classification, supplemented by a label distribution learning module to handle noisy data and enhance prediction accuracy.

Authors:Biao Ouyang, Yingying Zhang, Hanyin Cheng, Yang Shu, Chenjuan Guo, Bin Yang, Qingsong Wen, Lunting Fan, Christian S. Jensen
Title: RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems
Abstract:
With the continued migration of storage to cloud database systems,the impact of slow queries in such systems on services and user experience is increasing. Root-cause diagnosis plays an indispensable role in facilitating slow-query detection and revision. This paper proposes a method capable of both identifying possible root cause types for slow queries and ranking these according to their potential for accelerating slow queries. This enables prioritizing root causes with the highest impact, in turn improving slow-query revision effectiveness. To enable more accurate and detailed diagnoses, we propose the multimodal Ranking for the Root Causes of slow queries (RCRank) framework, which formulates root cause analysis as a multimodal machine learning problem and leverages multimodal information from query statements, execution plans, execution logs, and key performance indicators. To obtain expressive embeddings from its heterogeneous multimodal input, RCRank integrates self-supervised pre-training that enhances cross-modal alignment and task relevance. Next, the framework integrates root-cause-adaptive cross Transformers that enable adaptive fusion of multimodal features with varying characteristics. Finally, the framework offers a unified model that features an impact-aware training objective for identifying and ranking root causes. We report on experiments on real and synthetic datasets, finding that RCRank is capable of consistently outperforming the state-of-the-art methods at root cause identification and ranking according to a range of metrics.
中文: 本文提出RCRank多模态机器学习框架,通过融合查询语句、执行计划等多源数据识别并排序慢查询根因,在诊断准确性和影响优先级评估方面显著优于现有方法。
English: This paper introduces RCRank, a multimodal machine learning framework that identifies and ranks root causes of slow queries by integrating diverse data sources, significantly outperforming existing methods in diagnostic accuracy and impact prioritization.

Authors:Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, Xipeng Qiu
Title: How to Mitigate Overfitting in Weak-to-strong Generalization?
Abstract:
Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of \textbf{superalignment}. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR compared to naive weak-to-strong generalization, even achieving up to 100\% PGR on some models.
中文摘要:本研究针对弱监督到强泛化中的过拟合问题,提出一个双阶段框架,通过同时提升监督信号和输入问题的质量,在多个大语言模型和数学基准测试中实现了显著的性能提升。
English Summary: The study addresses overfitting in weak-to-strong generalization for AI superalignment by proposing a two-stage framework that enhances both supervision signals and input questions, achieving significant performance gains across multiple language models and mathematical benchmarks.

Authors:Zihao Zhao, Chenxiao Fan, Junlong Liu, Zheng Wang, Xiangnan He, Chongming Gao, Juan Li, Fuli Feng
Title: Fine-grained Alignment of Large Language Models for General Medication Recommendation without Overprescription
Abstract:
Large language models (LLMs) holds significant promise in achieving general medication recommendation systems owing to their comprehensive interpretation of clinical notes and flexibility to medication encoding. We evaluated both general-purpose and medical-specific LLMs for medication recommendations, showing their unsatisfactory precision and severe overprescription. To address this, we introduce Language-Assisted Medication Recommendation, which tailors LLMs for medication recommendation in a medication-aware manner, improving the usage of clinical notes. Fine-tuning LLMs with this framework can outperform existing methods by more than 10% in internal validation and generalize across temporal and external validations. Furthermore, the model maintains high accuracy when encountering out-of-distribution medication.
大型语言模型在药物推荐方面潜力巨大但存在精度不足的问题,新提出的框架通过优化临床记录利用显著提升性能,并在各类验证中展现出优越的泛化能力。
Large language models show promise for medication recommendations but face precision issues, which are addressed by a new framework that significantly improves performance and generalizability across validations.

Authors:Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou, Mingtao Pei, Siyuan Huang
Title: Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Abstract:
Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.
中文: Mocap-2-to-3框架通过利用2D数据预训练和创新的运动表示方法,解决了单目输入恢复绝对人体运动的难题,实现了高精度三维运动重建,并展现出卓越的泛化能力和真实感。
English: The Mocap-2-to-3 framework overcomes monocular motion recovery challenges by leveraging 2D data pre-training and a novel human motion representation to reconstruct metrically accurate 3D motions with superior generalization and realism.

Authors:Gaozheng Pei, Shaojie Lyu, Gong Chen, Ke Ma, Qianqian Xu, Yingfei Sun, Qingming Huang
Title: Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification
Abstract:
Existing diffusion-based purification methods aim to disrupt adversarial perturbations by introducing a certain amount of noise through a forward diffusion process, followed by a reverse process to recover clean examples. However, this approach is fundamentally flawed: the uniform operation of the forward process across all pixels compromises normal pixels while attempting to combat adversarial perturbations, resulting in the target model producing incorrect predictions. Simply relying on low-intensity noise is insufficient for effective defense. To address this critical issue, we implement a heterogeneous purification strategy grounded in the interpretability of neural networks. Our method decisively applies higher-intensity noise to specific pixels that the target model focuses on while the remaining pixels are subjected to only low-intensity noise. This requirement motivates us to redesign the sampling process of the diffusion model, allowing for the effective removal of varying noise levels. Furthermore, to evaluate our method against strong adaptative attack, our proposed method sharply reduces time cost and memory usage through a single-step resampling. The empirical evidence from extensive experiments across three datasets demonstrates that our method outperforms most current adversarial training and purification techniques by a substantial margin.
中文: 本文提出了一种基于神经网络可解释性的异质净化策略,对关键像素施加高强度噪声而其他像素仅用低强度噪声,在显著降低计算成本的同时大幅超越了现有防御方法的性能。
English: This paper introduces a heterogeneous purification strategy that applies varying noise intensities to different pixels based on neural network interpretability, significantly outperforming existing methods in defense effectiveness while reducing computational costs.

Authors:Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng
Title: Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
Abstract:
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. To address these limitations, we propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time. Specifically, our framework consists of three separate stages: (1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly. (3) With the above two-stage models excelling in motion controllability and degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of our framework over existing methods.
中文: 本文提出了一种新颖的外推与解耦框架,通过三阶段模型融合策略有效提升了图像到视频生成中的运动可控性和动态范围,显著优于现有方法。
English: This paper introduces a novel Extrapolating and Decoupling framework to address the challenges in Image-to-Video generation, enhancing motion controllability and dynamics through a three-stage model merging approach that outperforms existing methods.

Authors:Hongming Zhang, Ruixin Hong, Dong Yu
Title: Streaming Looking Ahead with Token-level Self-reward
Abstract:
Autoregressive decoding algorithms that use only past information often cannot guarantee the best performance. Recently, people discovered that looking-ahead algorithms such as Monte Carlo Tree Search (MCTS) with external reward models (RMs) can significantly improve models' output by allowing them to think ahead and leverage future outputs and associated rewards to guide the current generation. Such techniques can help the reinforcement fine-tuning phase by sampling better trajectories and the inference phase by selecting the better output. However, their high computational cost limits their applications, especially in streaming scenarios. To address this issue, we propose equipping the policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication. We name the new architecture as Reward Transformer. In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization. Experiments show that SLA achieves an overall win rate of 79.7\% against the baseline greedy decoding algorithm on three general-domain datasets with a frozen policy model while maintaining streaming efficiency. If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4\%.
中文:提出的奖励变换器与流式前瞻算法相结合,无需依赖外部模型即可显著超越基线方法,并在保持流式效率的同时,与强化微调技术结合时获得更高胜率。
English: The proposed Reward Transformer with streaming-looking-ahead algorithm eliminates dependency on external models while achieving significantly higher win rates against baseline methods, maintaining streaming efficiency even when combined with reinforcement fine-tuning techniques.

Authors:Jingxian Xu, Mengyu Zhou, Weichang Liu, Hanbing Liu, Shi Han, Dongmei Zhang
Title: TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance
Abstract:
Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.
中文总结:TwT方法通过多教师引导的习惯性推理蒸馏将显式推理内化,在减少推理令牌的同时保持高性能,实现了高达13.6%的准确率提升,为LLM高效部署提供了实用方案。
English Summary: The proposed TwT method reduces LLM inference costs by internalizing reasoning through habitual distillation and multi-teacher guidance, achieving higher accuracy with fewer tokens while maintaining performance.

Authors:Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang
Title: DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers
Abstract:
Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.
中文: DiTFastAttnV2提出了一种训练后压缩方法,采用头向箭头注意力和高效融合内核,在保持图像质量的同时,将注意力计算量降低68%并实现2K图像生成1.5倍加速。
English: DiTFastAttnV2 introduces a post-training compression method with head-wise arrow attention and an efficient fused kernel, achieving a 68% reduction in attention FLOPs and 1.5x speedup for 2K image generation while preserving quality.

Authors:Yuto Nishida, Makoto Morishita, Hiroyuki Deguchi, Hidetaka Kamigaito, Taro Watanabe
Title: Long-Tail Crisis in Nearest Neighbor Language Models
Abstract:
The $k$-nearest-neighbor language model ($k$NN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference. A widely held hypothesis for the success of $k$NN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena. However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model's performance remain underexplored in estimating the probabilities of long-tail target tokens during inference. In this paper, we investigate the behavior of $k$NN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, token distribution in the datastore, and approximation error of the product quantization. Our experimental results reveal that $k$NN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.
Chinese: $k$NN-LM模型主要提升高频词汇的预测性能,而非改善低频词汇的预测效果,尽管其数据存储包含长尾语境。
English: The $k$NN-LM model primarily enhances predictions for high-frequency tokens rather than improving performance on low-frequency tokens, despite its datastore containing long-tail contexts.

Authors:Le Qiu, Zelai Xu, Qixin Tan, Wenhao Tang, Chao Yu, Yu Wang
Title: AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models
Abstract:
Assessing the safety of autonomous driving policy is of great importance, and reinforcement learning (RL) has emerged as a powerful method for discovering critical vulnerabilities in driving policies. However, existing RL-based approaches often struggle to identify vulnerabilities that are both effective-meaning the autonomous vehicle is genuinely responsible for the accidents-and diverse-meaning they span various failure types. To address these challenges, we propose AED, a framework that uses large language models (LLMs) to automatically discover effective and diverse vulnerabilities in autonomous driving policies. We first utilize an LLM to automatically design reward functions for RL training. Then we let the LLM consider a diverse set of accident types and train adversarial policies for different accident types in parallel. Finally, we use preference-based learning to filter ineffective accidents and enhance the effectiveness of each vulnerability. Experiments across multiple simulated traffic scenarios and tested policies show that AED uncovers a broader range of vulnerabilities and achieves higher attack success rates compared with expert-designed rewards, thereby reducing the need for manual reward engineering and improving the diversity and effectiveness of vulnerability discovery.
中文摘要:提出的AED框架利用大语言模型,通过奖励函数设计、并行对抗训练和偏好学习筛选,自动发现自动驾驶策略中既有效又多样化的安全漏洞,在漏洞覆盖范围和攻击成功率上均优于人工设计方法。
English Summary: The proposed AED framework leverages large language models to automatically generate effective and diverse vulnerabilities in autonomous driving policies through reward function design, parallel adversarial training, and preference-based filtering, outperforming manual methods in both scope and success rate.

Authors:Rakesh Nadig, Vamanan Arulchelvan, Rahul Bera, Taha Shahroodi, Gagandeep Singh, Andreas Kakolyris, Mohammad Sadrosadati, Jisung Park, Onur Mutlu
Title: Harmonia: A Multi-Agent Reinforcement Learning Approach to Data Placement and Migration in Hybrid Storage Systems
Abstract:
Hybrid storage systems (HSS) integrate multiple storage devices with diverse characteristics to deliver high performance and capacity at low cost. The performance of an HSS highly depends on the effectiveness of two key policies: (1) the data-placement policy, which determines the best-fit storage device for incoming data, and (2) the data-migration policy, which dynamically rearranges stored data (i.e., prefetches hot data and evicts cold data) across the devices to sustain high HSS performance. Prior works optimize either data placement or data migration in isolation, which leads to suboptimal HSS performance. Unfortunately, no prior work tries to optimize both policies together. Our goal is to design a holistic data-management technique that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS, and thus significantly improve system performance. We propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique that employs two lightweight autonomous RL agents, a data-placement agent and a data-migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other to improve overall HSS performance. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and seventeen data-intensive workloads. On performance-optimized (cost-optimized) HSS with two storage devices, Harmonia outperforms the best-performing prior approach by 49.5% (31.7%) on average. On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 37.0% (42.0%) on average. Harmonia's performance benefits come with low latency (240ns for inference) and storage overheads (206 KiB in DRAM for both RL agents combined). We will open-source Harmonia's implementation to aid future research on HSS.
中文: Harmonia是一种基于多智能体强化学习的数据管理技术,通过协同优化混合存储系统中的数据放置与迁移策略,以低开销显著提升系统性能。
English: Harmonia is a multi-agent reinforcement learning-based data-management technique that simultaneously optimizes data placement and migration policies in hybrid storage systems, significantly improving performance with low overhead.

Authors:Navami Kairanda, Marc Habermann, Shanthika Naik, Christian Theobalt, Vladislav Golyanik
Title: Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields
Abstract:
3D reconstruction of highly deformable surfaces (e.g. cloths) from monocular RGB videos is a challenging problem, and no solution provides a consistent and accurate recovery of fine-grained surface details. To account for the ill-posed nature of the setting, existing methods use deformation models with statistical, neural, or physical priors. They also predominantly rely on nonadaptive discrete surface representations (e.g. polygonal meshes), perform frame-by-frame optimisation leading to error propagation, and suffer from poor gradients of the mesh-based differentiable renderers. Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose ThinShell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. We incorporate continuous thin shell physics prior based on the Kirchhoff-Love model for spatial regularisation, which starkly contrasts the discretised alternatives of earlier works. Lastly, we leverage 3D Gaussian splatting to differentiably render the surface into image space and optimise the deformations based on analysis-bysynthesis principles. Our Thin-Shell-SfT outperforms prior works qualitatively and quantitatively thanks to our continuous surface formulation in conjunction with a specially tailored simulation prior and surface-induced 3D Gaussians. See our project page at https://4dqv.mpiinf.mpg.de/ThinShellSfT.
中文: ThinShell-SfT提出了一种新颖的非刚性三维跟踪方法,通过隐式时空神经场表示表面,结合连续薄壳物理先验和三维高斯溅射技术,在恢复布料褶皱等细微表面细节方面显著优于现有方法。
English: ThinShell-SfT introduces a novel method for 3D reconstruction of deformable surfaces using an implicit spatiotemporal neural field, enhanced by continuous thin shell physics and 3D Gaussian splatting, outperforming prior approaches in recovering fine details like cloth wrinkles.

Authors:F. Nisa Bostancı, Oğuzhan Canpolat, Ataberk Olgun, İsmail Emir Yüksel, Konstantinos Kanellopoulos, Mohammad Sadrosadati, A. Giray Yağlıkçı, Onur Mutlu
Title: Understanding and Mitigating Side and Covert Channel Vulnerabilities Introduced by RowHammer Defenses
Abstract:
DRAM chips are vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows. Attackers leverage RowHammer bitflips in real systems to take over systems and leak data. Consequently, many prior works propose mitigations, including recent DDR specifications introducing new mitigations (e.g., PRAC and RFM). For robust operation, it is critical to analyze other security implications of RowHammer mitigations. Unfortunately, no prior work analyzes the timing covert and side channel vulnerabilities introduced by RowHammer mitigations. This paper presents the first analysis and evaluation of timing covert and side channel vulnerabilities introduced by state-of-the-art RowHammer mitigations. We demonstrate that RowHammer mitigations' preventive actions have two fundamental features that enable timing channels. First, preventive actions reduce DRAM bandwidth availability, resulting in longer memory latencies. Second, preventive actions can be triggered on demand depending on memory access patterns. We introduce LeakyHammer, a new class of attacks that leverage the RowHammer mitigation-induced memory latency differences to establish communication channels and leak secrets. First, we build two covert channel attacks exploiting two state-of-the-art RowHammer mitigations, achieving 38.6 Kbps and 48.6 Kbps channel capacity. Second, we demonstrate a website fingerprinting attack that identifies visited websites based on the RowHammer-preventive actions they cause. We propose and evaluate two countermeasures against LeakyHammer and show that fundamentally mitigating LeakyHammer induces large overheads in highly RowHammer-vulnerable systems. We believe and hope our work can enable and aid future work on designing robust systems against RowHammer mitigation-based side and covert channels.
中文摘要:本文首次提出LeakyHammer攻击,揭示了RowHammer防护机制中的时序隐蔽与侧信道漏洞,通过其预防性操作可建立高带宽通信信道并实现网站指纹识别。
English Summary: This paper introduces LeakyHammer, the first analysis of timing covert and side channel vulnerabilities in RowHammer mitigations, demonstrating how their preventive actions enable high-capacity communication channels and website fingerprinting attacks.

Authors:Da Ma, Gonghu Shang, Zhi Chen, Libo Qin, Yijie Luo, Lei Pan, Shuai Fan, Lu Chen, Kai Yu
Title: Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations
Abstract:
Instruction tuning improves the ability of large language models (LLMs) to follow diverse human instructions, but achieving strong performance on specific target tasks remains challenging. A critical bottleneck is selecting the most relevant data to maximize task-specific performance. Existing data selection approaches include unstable influence-based methods and more stable distribution alignment methods, the latter of which critically rely on the underlying sample representation. In practice, most distribution alignment methods, from shallow features (e.g., BM25) to neural embeddings (e.g., BGE, LLM2Vec), may fail to capture how the model internally processes samples. To bridge this gap, we adopt a model-centric strategy in which each sample is represented by its neuronal activation pattern in the model, directly reflecting internal computation. However, directly using raw neuron activations leads to spurious similarity between unrelated samples due to neuron polysemanticity, where a single neuron may respond to multiple, unrelated concepts. To address this, we employ sparse autoencoders to disentangle polysemantic activations into sparse, monosemantic representations, and introduce a dedicated similarity metric for this space to better identify task-relevant data. Comprehensive experiments across multiple instruction datasets, models, tasks, and selection ratios show that our approach consistently outperforms existing data selection baselines in both stability and task-specific performance.
Chinese: 指令调优提升了大型语言模型遵循人类指令的能力,但针对特定任务选择最相关数据仍是挑战;我们的方法通过稀疏自编码器解析神经元激活模式,在稳定性和任务性能上均优于现有数据选择基准。
English: Instruction tuning enhances large language models' ability to follow human instructions, but selecting the most relevant data for specific tasks remains challenging, which our approach addresses by using sparse autoencoders to create neuron activation-based representations that outperform existing methods in stability and performance.

Authors:Deyin Yi, Yihao Liu, Lang Cao, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang
Title: TablePilot: Recommending Human-Preferred Tabular Data Analysis with Large Language Models
Abstract:
Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical operations, and the demand for high-quality analysis make the process tedious. To address these challenges, we aim to recommend query-code-result triplets tailored for new tables in tabular data analysis workflows. In this paper, we present TablePilot, a pioneering tabular data analysis framework leveraging large language models to autonomously generate comprehensive and superior analytical results without relying on user profiles or prior interactions. The framework incorporates key designs in analysis preparation and analysis optimization to enhance accuracy. Additionally, we propose Rec-Align, a novel method to further improve recommendation quality and better align with human preferences. Experiments on DART, a dataset specifically designed for comprehensive tabular data analysis recommendation, demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations further highlight its effectiveness in optimizing tabular data analysis workflows.
Chinese: TablePilot是一种创新框架,利用大语言模型为表格数据分析自主生成高质量的查询-代码-结果三元组,在DART数据集上实现77.0%的Top-5召回率,并通过人工评估验证了其优化分析流程的有效性。
English: TablePilot is a novel framework that employs large language models to autonomously generate high-quality query-code-result triplets for tabular data analysis, achieving 77.0% top-5 recall on the DART dataset and demonstrating effectiveness through human evaluations.

Authors:Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt
Title: Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
Abstract:
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward -- a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Also, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE). The source code, trained models, and datasets are available on our project page at https://4dqv.mpi-inf.mpg.de/EgoRear/.
中文: 本文研究表明,在头戴设备上增加后置摄像头能通过克服仅用前置视角的局限显著提升三维全身姿态追踪效果,并提出一种基于Transformer的新方法及新数据集,使追踪精度提升超过10%。
English: This paper demonstrates that adding rear cameras to head-mounted devices significantly improves 3D full-body pose tracking by overcoming limitations of frontal-only views, and introduces a transformer-based method with new datasets that achieves over 10% improvement in accuracy.

Authors:Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu
Title: HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models
Abstract:
Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: https://ziqinzhou66.github.io/project/HiTVideo.
中文: HiTVideo采用分层标记器和3D因果VAE,将视频高效编码为结构化码本,在保持质量的同时显著降低比特率约70%,有效弥合文本与视觉的鸿沟,推动文本到视频生成的发展。
English: HiTVideo introduces a hierarchical tokenizer using 3D causal VAE to efficiently encode videos into structured codebooks, achieving 70% lower bpp while maintaining quality and bridging text-vision gaps in generation.

Authors:Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, Enhong Chen
Title: ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning
Abstract:
With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.
Chinese: ImageScope 是一种免训练的三阶段框架,通过集体推理统一语言引导的图像检索任务,将多样化任务转化为广义的文本到图像检索过程,并在多个数据集上超越现有方法。
English: ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks through collective reasoning, transforming diverse tasks into a generalized text-to-image retrieval process and outperforming existing methods on multiple datasets.

Authors:Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li
Title: MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Abstract:
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
中文: 本文提出了一种评估文本分块质量的双指标方法,并设计了混合分块器框架来解决传统方法的局限性,通过结构化分块提取显著提升了检索增强生成系统的性能。
English: This paper introduces a dual-metric evaluation method for text chunking quality and proposes the Mixture-of-Chunkers framework to address limitations in traditional methods, ultimately enhancing Retrieval-Augmented Generation system performance through structured chunk extraction.

Authors:Lijie Hu, Junchi Liao, Weimin Lyu, Shaopeng Fu, Tianhao Huang, Shu Yang, Guimin Hu, Di Wang
Title: C^2 ATTACK: Towards Representation Backdoor on CLIP via Concept Confusion
Abstract:
Backdoor attacks pose a significant threat to deep learning models, enabling adversaries to embed hidden triggers that manipulate the behavior of the model during inference. Traditional backdoor attacks typically rely on inserting explicit triggers (e.g., external patches, or perturbations) into input data, but they often struggle to evade existing defense mechanisms. To address this limitation, we investigate backdoor attacks through the lens of the reasoning process in deep learning systems, drawing insights from interpretable AI. We conceptualize backdoor activation as the manipulation of learned concepts within the model's latent representations. Thus, existing attacks can be seen as implicit manipulations of these activated concepts during inference. This raises interesting questions: why not manipulate the concepts explicitly? This idea leads to our novel backdoor attack framework, Concept Confusion Attack (C^2 ATTACK), which leverages internal concepts in the model's reasoning as "triggers" without introducing explicit external modifications. By avoiding the use of real triggers and directly activating or deactivating specific concepts in latent spaces, our approach enhances stealth, making detection by existing defenses significantly harder. Using CLIP as a case study, experimental results demonstrate the effectiveness of C^2 ATTACK, achieving high attack success rates while maintaining robustness against advanced defenses.
Chinese: 概念混淆攻击(C^2 ATTACK)提出了一种新颖的后门框架,通过操纵模型推理过程中的内部概念作为触发器,无需外部修改即可增强隐蔽性并有效规避现有防御机制。
English: The Concept Confusion Attack (C^2 ATTACK) introduces a novel backdoor framework that manipulates internal concepts in a model's reasoning process as triggers, enhancing stealth and bypassing existing defenses without external modifications.

Authors:Mayank Kabra, Rakesh Nadig, Harshita Gupta, Rahul Bera, Manos Frouzakis, Vamanan Arulchelvan, Yu Liang, Haiyu Mao, Mohammad Sadrosadati, Onur Mutlu
Title: CIPHERMATCH: Accelerating Homomorphic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing
Abstract:
Homomorphic encryption (HE) allows secure computation on encrypted data without revealing the original data, providing significant benefits for privacy-sensitive applications. Many cloud computing applications (e.g., DNA read mapping, biometric matching, web search) use exact string matching as a key operation. However, prior string matching algorithms that use homomorphic encryption are limited by high computational latency caused by the use of complex operations and data movement bottlenecks due to the large encrypted data size. In this work, we provide an efficient algorithm-hardware codesign to accelerate HE-based secure exact string matching. We propose CIPHERMATCH, which (i) reduces the increase in memory footprint after encryption using an optimized software-based data packing scheme, (ii) eliminates the use of costly homomorphic operations (e.g., multiplication and rotation), and (iii) reduces data movement by designing a new in-flash processing (IFP) architecture. We demonstrate the benefits of CIPHERMATCH using two case studies: (1) Exact DNA string matching and (2) encrypted database search. Our pure software-based CIPHERMATCH implementation that uses our memory-efficient data packing scheme improves performance and reduces energy consumption by 42.9X and 17.6X, respectively, compared to the state-of-the-art software baseline. Integrating CIPHERMATCH with IFP improves performance and reduces energy consumption by 136.9X and 256.4X, respectively, compared to the software-based CIPHERMATCH implementation.
中文: CIPHERMATCH提出了一种高效的算法-硬件协同设计,通过优化数据打包和闪存内处理架构,显著降低了基于同态加密的精确字符串匹配的计算延迟和能耗。
English: CIPHERMATCH introduces an efficient algorithm-hardware codesign for homomorphic encryption-based exact string matching, reducing computational latency and energy consumption through optimized data packing and in-flash processing architecture.

Authors:Xun Liang, Hanyu Wang, Huayi Lai, Simin Niu, Shichao Song, Jiawei Yang, Jihao Zhao, Feiyu Xiong, Bo Tang, Zhiyu Li
Title: SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models
Abstract:
Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.
中文: 本文提出稀疏专家激活剪枝方法SEAP,这种免训练技术通过选择性保留任务相关参数,在保持大语言模型性能的同时显著降低计算开销,实验表明其在50%剪枝率下比现有方法性能提升超20%,仅带来2.2%的性能损失。
English: This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free method that selectively prunes task-irrelevant parameters in Large Language Models to significantly reduce computational costs while maintaining competitive performance, with experimental results showing over 20% improvement over existing methods at 50% pruning.

Authors:Chuqi Wang, Chao Yu, Xin Xu, Yuman Gao, Xinyi Yang, Wenhao Tang, Shu'ang Yu, Yinuo Chen, Feng Gao, ZhuoZhu Jian, Xinlei Chen, Fei Gao, Boyu Zhou, Yu Wang
Title: Multi-Robot System for Cooperative Exploration in Unknown Environments: A Survey
Abstract:
With the real need of field exploration in large-scale and extreme outdoor environments, cooperative exploration tasks have garnered increasing attention. This paper presents a comprehensive review of multi-robot cooperative exploration systems. First, we review the evolution of robotic exploration and introduce a modular research framework tailored for multi-robot cooperative exploration. Based on this framework, we systematically categorize and summarize key system components. As a foundational module for multi-robot exploration, the localization and mapping module is primarily introduced by focusing on global and relative pose estimation, as well as multi-robot map merging techniques. The cooperative motion module is further divided into learning-based approaches and multi-stage planning, with the latter encompassing target generation, task allocation, and motion planning strategies. Given the communication constraints of real-world environments, we also analyze the communication module, emphasizing how robots exchange information within local communication ranges and under limited transmission capabilities. In addition, we introduce the actual application of multi-robot cooperative exploration systems in DARPA SubT Challenge. Finally, we discuss the challenges and future research directions for multi-robot cooperative exploration in light of real-world trends. This review aims to serve as a valuable reference for researchers and practitioners in the field.
中文摘要:本文系统综述了多机器人协同探索系统,涵盖定位建图、协同运动和通信等核心模块,结合实际应用探讨了未来挑战与发展方向。
English Summary: This paper provides a comprehensive review of multi-robot cooperative exploration systems, detailing key components like localization, motion planning, and communication, while discussing real-world applications and future challenges.

Authors:Hongshen Xu, Zixv yang, Zichen Zhu, Kunyao Lan, Zihan Wang, Mengyue Wu, Ziwei Ji, Lu Chen, Pascale Fung, Kai Yu
Title: Delusions of Large Language Models
Abstract:
Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.
中文摘要:大语言模型存在一种称为“妄想”的危险现象,即模型以异常高的置信度生成事实错误的输出,比普通幻觉更难检测和纠正,对模型可靠性构成严重威胁。
English Summary: Large Language Models exhibit a dangerous phenomenon called "delusion," where they produce factually incorrect outputs with abnormally high confidence that are harder to detect and mitigate than ordinary hallucinations, posing significant reliability challenges.

Authors:Hongshen Xu, Zihan Wang, Zichen Zhu, Lei Pan, Xingyu Chen, Lu Chen, Kai Yu
Title: Alignment for Efficient Tool Calling of Large Language Models
Abstract:
Recent advancements in tool learning have enabled large language models (LLMs) to integrate external tools, enhancing their task performance by expanding their knowledge boundaries. However, relying on tools often introduces tradeoffs between performance, speed, and cost, with LLMs sometimes exhibiting overreliance and overconfidence in tool usage. This paper addresses the challenge of aligning LLMs with their knowledge boundaries to make more intelligent decisions about tool invocation. We propose a multi objective alignment framework that combines probabilistic knowledge boundary estimation with dynamic decision making, allowing LLMs to better assess when to invoke tools based on their confidence. Our framework includes two methods for knowledge boundary estimation, consistency based and absolute estimation, and two training strategies for integrating these estimates into the model decision making process. Experimental results on various tool invocation scenarios demonstrate the effectiveness of our framework, showing significant improvements in tool efficiency by reducing unnecessary tool usage.
Chinese: 本文提出一种多目标对齐框架,通过结合概率知识边界估计与动态决策机制,使大语言模型能更智能地判断何时调用工具,显著减少不必要的工具使用并提升效率。
English: This paper introduces a multi-objective alignment framework that enhances large language models' ability to intelligently decide when to invoke tools by combining probabilistic knowledge boundary estimation with dynamic decision-making, significantly improving tool efficiency by reducing unnecessary usage.

Authors:Xinyi He, Yihao Liu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Zejian Yuan, Dongmei Zhang
Title: TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models
Abstract:
Tabular data are crucial in many fields and their understanding by large language models (LLMs) under high parameter efficiency paradigm is important. However, directly applying parameter-efficient fine-tuning (PEFT) techniques to tabular tasks presents significant challenges, particularly in terms of better table serialization and the representation of two-dimensional structured information within a one-dimensional sequence. To address this, we propose TableLoRA, a module designed to improve LLMs' understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions. Experiments on four tabular-related datasets demonstrate that TableLoRA consistently outperforms vanilla LoRA and surpasses various table encoding methods tested in control experiments. These findings reveal that TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process tabular data effectively, especially in low-parameter settings, demonstrating its potential as a robust solution for handling table-related tasks.
中文: TableLoRA通过引入特殊标记和二维位置编码,有效提升了大型语言模型在参数高效微调中对表格结构的理解能力,在多项实验中表现优于传统方法。
English: TableLoRA enhances LLMs' understanding of tabular data by incorporating specialized tokens and 2D encoding, outperforming existing methods in parameter-efficient fine-tuning across multiple datasets.

Authors:Hang Zheng, Hongshen Xu, Yuncong Liu, Lu Chen, Pascale Fung, Kai Yu
Title: Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Abstract:
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.
中文摘要:提出的显式知识边界建模(EKBM)框架通过整合快慢推理系统,在保持计算效率的同时提升大语言模型的自我认知能力,有效减少幻觉并增强可靠性。
English Summary: The proposed Explicit Knowledge Boundary Modeling (EKBM) framework combines fast and slow reasoning systems to enhance large language model reliability by improving self-awareness and reducing hallucinations while maintaining computational efficiency.

Authors:Hang Zheng, Hongshen Xu, Yuncong Liu, Lu Chen, Pascale Fung, Kai Yu
Title: Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Abstract:
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.
中文摘要:提出的显式知识边界建模(EKBM)框架通过整合快慢推理系统,在保持计算效率的同时提升大语言模型的自我认知能力,有效减少幻觉并增强可靠性。
English Summary: The proposed Explicit Knowledge Boundary Modeling (EKBM) framework combines fast and slow reasoning systems to enhance large language model reliability by improving self-awareness and reducing hallucinations while maintaining computational efficiency.

Authors:Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, Ming Zhang
Title: Attention Bootstrapping for Multi-Modal Test-Time Adaptation
Abstract:
Test-time adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adaptation by proposing a novel method named Attention Bootstrapping with Principal Entropy Minimization (ABPEM). We observe that test-time distribution shift causes misalignment across modalities, leading to a large gap between intra-modality discrepancies (measured by self-attention) and inter-modality discrepancies (measured by cross-attention). We name this the attention gap. This attention gap widens with more severe distribution shifts, hindering effective modality fusion. To mitigate this attention gap and encourage better modality fusion, we propose attention bootstrapping that promotes cross-attention with the guidance of self-attention. Moreover, to reduce the gradient noise in the commonly-used entropy minimization, we adopt principal entropy minimization, a refinement of entropy minimization that reduces gradient noise by focusing on the principal parts of entropy, excluding less reliable gradient information. Extensive experiments on the benchmarks validate the effectiveness of the proposed ABPEM in comparison with competing baselines.
中文: 本文提出ABPEM方法,通过注意力引导和主熵最小化解决多模态测试时适应中的分布偏移问题,实验证明其优于现有基线。
English: This paper introduces ABPEM, a novel method for multi-modal test-time adaptation that addresses distribution shifts by reducing the attention gap through attention bootstrapping and principal entropy minimization, validated by extensive experiments.

Authors:Yuxin Wang, Botian Jiang, Yiran Guo, Quan Gan, David Wipf, Xuanjing Huang, Xipeng Qiu
Title: Prior-Fitted Networks Scale to Larger Datasets When Treated as Weak Learners
Abstract:
Prior-Fitted Networks (PFNs) have recently been proposed to efficiently perform tabular classification tasks. Although they achieve good performance on small datasets, they encounter limitations with larger datasets. These limitations include significant memory consumption and increased computational complexity, primarily due to the impracticality of incorporating all training samples as inputs within these networks. To address these challenges, we investigate the fitting assumption for PFNs and input samples. Building on this understanding, we propose \textit{BoostPFN} designed to enhance the performance of these networks, especially for large-scale datasets. We also theoretically validate the convergence of BoostPFN and our empirical results demonstrate that the BoostPFN method can outperform standard PFNs with the same size of training samples in large datasets and achieve a significant acceleration in training times compared to other established baselines in the field, including widely-used Gradient Boosting Decision Trees (GBDTs), deep learning methods and AutoML systems. High performance is maintained for up to 50x of the pre-training size of PFNs, substantially extending the limit of training samples. Through this work, we address the challenges of efficiently handling large datasets via PFN-based models, paving the way for faster and more effective tabular data classification training and prediction process. Code is available at Github.
中文: 提出的BoostPFN方法解决了先验拟合网络在处理大规模数据集时的内存和计算限制,相比标准PFN及其他基线方法,在保持高性能的同时实现了更快的训练速度,并能有效处理更多训练样本。
English: The proposed BoostPFN method overcomes the memory and computational limitations of Prior-Fitted Networks (PFNs) on large datasets, achieving superior performance and faster training times compared to standard PFNs and other baselines while maintaining high performance with significantly more training samples.

Authors:Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe
Title: Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
Abstract:
The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.
中文摘要:本研究提出了一种新颖的评估方法,用于检测音频基础模型在对话中的轮换能力,发现这些模型在管理语音重叠和沉默方面存在明显不足。
English Summary: This study introduces a novel evaluation protocol to assess audio foundation models' turn-taking capabilities in conversations, revealing significant shortcomings in their ability to manage speech overlaps and silences effectively.

Authors:Zirui Chen, Xing Hu, Puhua Sun, Xin Xia, Xiaohu Yang
Title: Generating Mitigations for Downstream Projects to Neutralize Upstream Library Vulnerability
Abstract:
Third-party libraries are essential in software development as they prevent the need for developers to recreate existing functionalities. However, vulnerabilities within these libraries pose significant risks to dependent projects. Upgrading dependencies to secure versions is not feasible to neutralize vulnerabilities without patches or in projects with specific version requirements. Moreover, repairing the vulnerability proves challenging when the source code of the library is inaccessible. Both the state-of-the-art automatic vulnerability repair and automatic program repair methods fail to address this issue. Therefore, mitigating library vulnerabilities without source code and available patches is crucial for a swift response to potential security attacks. Existing tools encounter challenges concerning generalizability and functional security. In this study, we introduce LUMEN to mitigate library vulnerabilities in impacted projects. Upon disclosing a vulnerability, we retrieve existing workarounds to gather a resembling mitigation strategy. In cases where a resembling strategy is absent, we propose type-based strategies based on the vulnerability reproducing behavior and extract essential information from the vulnerability report to guide mitigation generation. Our assessment of LUMEN spans 121 impacted functions of 40 vulnerabilities, successfully mitigating 70.2% of the functions, which substantially outperforms our baseline in neutralizing vulnerabilities without functionality loss. Additionally, we conduct an ablation study to validate the rationale behind our resembling strategies and type-based strategies.
中文摘要:第三方库漏洞对软件项目构成重大风险,而LUMEN通过采用相似缓解策略和基于类型的策略,无需源代码或补丁即可有效修复这些漏洞,成功解决了70.2%的受影响功能。
English Summary: Third-party library vulnerabilities pose significant risks to software projects, and LUMEN effectively mitigates these vulnerabilities without requiring source code or patches by employing resembling and type-based strategies, successfully addressing 70.2% of impacted functions.

Authors:Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song
Title: Re-Aligning Language to Visual Objects with an Agentic Workflow
Abstract:
Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.
中文: 为解决视觉语言模型幻觉导致语言与物体对齐质量下降的问题,我们提出Real-LOD工作流,通过大语言模型控制的规划、工具调用和反思步骤循环优化描述,仅用少量重对齐数据即实现检测性能的显著提升。
English: To address VLM hallucinations that degrade language-object alignment in language-based object detection, we propose Real-LOD, an LLM-controlled workflow that cyclically refines descriptions through planning, tool use, and reflection, enabling superior performance with a small dataset of re-aligned expressions.

Authors:Zhihao Yuan, Yibo Peng, Jinke Ren, Yinghong Liao, Yatong Han, Chun-Mei Feng, Hengshuang Zhao, Guanbin Li, Shuguang Cui, Zhen Li
Title: Empowering Large Language Models with 3D Situation Awareness
Abstract:
Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.
中文: 本研究提出了一种创新方法,通过视觉语言模型和情境定位模块自动生成视角感知数据集,有效增强大语言模型在三维场景中的情境感知能力,在多个基准测试中显著提升性能并大幅减少人工标注成本。
English: The study introduces a novel approach to enhance large language models' 3D situational awareness by automatically generating perspective-aware datasets through vision-language models and a situation grounding module, significantly improving performance on benchmarks while reducing manual annotation.

Authors:Yuan Meng, Xiangtong Yao, Kejia Chen, Yansong Wu, Liding Zhang, Zhenshan Bing, Alois Knoll
Title: Pretrained Bayesian Non-parametric Knowledge Prior in Robotic Long-Horizon Reinforcement Learning
Abstract:
Reinforcement learning (RL) methods typically learn new tasks from scratch, often disregarding prior knowledge that could accelerate the learning process. While some methods incorporate previously learned skills, they usually rely on a fixed structure, such as a single Gaussian distribution, to define skill priors. This rigid assumption can restrict the diversity and flexibility of skills, particularly in complex, long-horizon tasks. In this work, we introduce a method that models potential primitive skill motions as having non-parametric properties with an unknown number of underlying features. We utilize a Bayesian non-parametric model, specifically Dirichlet Process Mixtures, enhanced with birth and merge heuristics, to pre-train a skill prior that effectively captures the diverse nature of skills. Additionally, the learned skills are explicitly trackable within the prior space, enhancing interpretability and control. By integrating this flexible skill prior into an RL framework, our approach surpasses existing methods in long-horizon manipulation tasks, enabling more efficient skill transfer and task success in complex environments. Our findings show that a richer, non-parametric representation of skill priors significantly improves both the learning and execution of challenging robotic tasks. All data, code, and videos are available at https://ghiara.github.io/HELIOS/.
中文: 本研究采用带生成合并启发式的狄利克雷过程混合模型构建非参数化技能先验,通过增强技能多样性和可追踪性,在复杂机器人任务中实现了优于现有方法的强化学习性能。
English: This study introduces a non-parametric skill prior using Dirichlet Process Mixtures with birth and merge heuristics, which enhances reinforcement learning by enabling diverse skill representation and explicit tracking, outperforming existing methods in complex robotic tasks.

Authors:Yuan Meng, Xiangtong Yao, Haihui Ye, Yirui Zhou, Shengqiang Zhang, Zhenguo Sun, Xukun Li, Zhenshan Bing, Alois Knoll
Title: Embodied Long Horizon Manipulation with Closed-loop Code Generation and Incremental Few-shot Adaptation
Abstract:
Embodied long-horizon manipulation requires robotic systems to process multimodal inputs-such as vision and natural language-and translate them into executable actions. However, existing learning-based approaches often depend on large, task-specific datasets and struggle to generalize to unseen scenarios. Recent methods have explored using large language models (LLMs) as high-level planners that decompose tasks into subtasks using natural language and guide pretrained low-level controllers. Yet, these approaches assume perfect execution from low-level policies, which is unrealistic in real-world environments with noise or suboptimal behaviors. To overcome this, we fully discard the pretrained low-level policy and instead use the LLM to directly generate executable code plans within a closed-loop framework. Our planner employs chain-of-thought (CoT)-guided few-shot learning with incrementally structured examples to produce robust and generalizable task plans. Complementing this, a reporter evaluates outcomes using RGB-D and delivers structured feedback, enabling recovery from misalignment and replanning under partial observability. This design eliminates per-step inference, reduces computational overhead, and limits error accumulation that was observed in previous methods. Our framework achieves state-of-the-art performance on 30+ diverse seen and unseen long-horizon tasks across LoHoRavens, CALVIN, Franka Kitchen, and cluttered real-world settings.
中文: 我们的框架摒弃了传统底层控制器,转而利用大语言模型在闭环系统中直接生成可执行代码计划,通过思维链引导学习和结构化反馈,在30多项多样化长程任务中实现了最先进的性能。
English: Our framework replaces traditional low-level controllers by using a large language model to directly generate executable code plans within a closed-loop system, achieving state-of-the-art performance on over 30 diverse long-horizon tasks through chain-of-thought-guided learning and structured feedback.

Authors:Jianhang Xiang, Zhipeng Gao, Lingfeng Bao, Xing Hu, Jiayuan Chen, Xin Xia
Title: Automating Comment Generation for Smart Contract from Bytecode
Abstract:
Recently, smart contracts have played a vital role in automatic financial and business transactions. To help end users without programming background to better understand the logic of smart contracts, previous studies have proposed models for automatically translating smart contract source code into their corresponding code summaries. However, in practice, only 13% of smart contracts deployed on the Ethereum blockchain are associated with source code. The practical usage of these existing tools is significantly restricted. Considering that bytecode is always necessary when deploying smart contracts, in this paper, we first introduce the task of automatically generating smart contract code summaries from bytecode. We propose a novel approach, named SmartBT (Smart contract Bytecode Translator) for automatically translating smart contract bytecode into fine-grained natural language description directly. Two key challenges are posed for this task: structural code logic hidden in bytecode and the huge semantic gap between bytecode and natural language descriptions. To address the first challenge, we transform bytecode into CFG (Control-Flow Graph) to learn code structural and logic details. Regarding the second challenge, we introduce an information retrieval component to fetch similar comments for filling the semantic gap. Then the structural input and semantic input are used to build an attentional sequence-to-sequence neural network model. The copy mechanism is employed to copy rare words directly from similar comments and the coverage mechanism is employed to eliminate repetitive outputs. The automatic evaluation results show that SmartBT outperforms a set of baselines by a large margin, and the human evaluation results show the effectiveness and potential of SmartBT in producing meaningful and accurate comments for smart contract code from bytecode directly.
中文摘要:本文提出SmartBT方法,通过将字节码转化为控制流图并引入信息检索组件来填补语义鸿沟,直接从智能合约字节码生成自然语言描述,评估显示其显著优于现有基线方法。
English Summary: This paper introduces SmartBT, a novel approach that generates smart contract summaries directly from bytecode by transforming it into control-flow graphs and using information retrieval to bridge semantic gaps, outperforming existing methods in evaluations.

Authors:Dingning Liu, Cheng Wang, Peng Gao, Renrui Zhang, Xinzhu Ma, Yuan Meng, Zhihui Wang
Title: 3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o
Abstract:
Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
中文: 本文提出了一种名为3DAxisPrompt的新型视觉提示方法,通过整合几何先验知识增强多模态大语言模型在真实场景中的三维理解能力,但其在不同三维任务中的效果存在差异。
English: This paper introduces 3DAxisPrompt, a novel visual prompting method that enhances Multimodal Large Language Models' 3D understanding in real-world scenarios by integrating geometric priors, though its effectiveness varies across different 3D tasks.

Authors:Yansong Wu, Xiao Chen, Yu Chen, Hamid Sadeghian, Fan Wu, Zhenshan Bing, Sami Haddadin, Alexander König, Alois Knoll
Title: SharedAssembly: A Data Collection Approach via Shared Tele-Assembly
Abstract:
Assembly is a fundamental skill for robots in both modern manufacturing and service robotics. Existing datasets aim to address the data bottleneck in training general-purpose robot models, falling short of capturing contact-rich assembly tasks. To bridge this gap, we introduce SharedAssembly, a novel bilateral teleoperation approach with shared autonomy for scalable assembly execution and data collection. User studies demonstrate that the proposed approach enhances both success rates and efficiency, achieving a 97.0% success rate across various sub-millimeter-level assembly tasks. Notably, novice and intermediate users achieve performance comparable to experts using baseline teleoperation methods, significantly enhancing large-scale data collection.
中文: SharedAssembly是一种具有共享自主性的双边遥操作系统,显著提高了接触密集型机器人装配任务的成功率和效率,使新手用户能够达到专家水平,并促进大规模数据收集。
English: SharedAssembly is a bilateral teleoperation system with shared autonomy that significantly improves success rates and efficiency in contact-rich robotic assembly tasks, enabling novice users to perform at expert levels and facilitating large-scale data collection.

Authors:Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Title: Multimodal Generation of Animatable 3D Human Models with AvatarForge
Abstract:
We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.
Chinese: AvatarForge 是一个基于人工智能的框架,能够从文本或图像生成可动画化的3D人体虚拟形象,通过精细控制和迭代优化克服了现有方法的局限,实现了更高的准确性和多功能性。
English: AvatarForge is an AI-driven framework that generates customizable and animatable 3D human avatars from text or images, overcoming limitations of existing methods through fine-grained control and iterative refinement for superior accuracy and versatility.

Authors:Thanh Linh Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham
Title: Right Reward Right Time for Federated Learning
Abstract:
Critical learning periods (CLPs) in federated learning (FL) refer to early stages during which low-quality contributions (e.g., sparse training data availability) can permanently impair the learning performance of the global model owned by the model owner (i.e., the cloud server). However, strategies to motivate clients with high-quality contributions to join the FL training process and share trained model updates during CLPs remain underexplored. Additionally, existing incentive mechanisms in FL treat all training periods equally, which consequently fails to motivate clients to participate early. Compounding this challenge is the cloud's limited knowledge of client training capabilities due to privacy regulations, leading to information asymmetry. Therefore, in this article, we propose a time-aware incentive mechanism, called Right Reward Right Time (R3T), to encourage client involvement, especially during CLPs, to maximize the utility of the cloud in FL. Specifically, the cloud utility function captures the trade-off between the achieved model performance and payments allocated for clients' contributions, while accounting for clients' time and system capabilities, efforts, joining time, and rewards. Then, we analytically derive the optimal contract for the cloud and devise a CLP-aware mechanism to incentivize early participation and efforts while maximizing cloud utility, even under information asymmetry. By providing the right reward at the right time, our approach can attract the highest-quality contributions during CLPs. Simulation and proof-of-concept studies show that R3T increases cloud utility and is more economically effective than benchmarks. Notably, our proof-of-concept results show up to a 47.6% reduction in the total number of clients and up to a 300% improvement in convergence time while reaching competitive test accuracies compared with incentive mechanism benchmarks.
中文摘要:联邦学习存在关键学习期,早期低质量贡献会永久损害模型性能,而现有激励机制未能优先考虑这些阶段,因此提出R3T机制通过适时奖励吸引高质量早期参与,从而最大化云端效用。
English Summary: Federated learning faces critical learning periods where early low-quality contributions can permanently harm model performance, yet existing incentives fail to prioritize these phases, prompting the proposed R3T mechanism that offers timely rewards to maximize cloud utility by attracting high-quality early participation.

Authors:Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
Title: Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Abstract:
Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.
中文:ThinkFirst是一种无需训练的分割推理框架,通过利用GPT的思维链生成详细图像描述,再由分割助手处理,有效提升复杂场景下的分割精度。
English: ThinkFirst is a training-free reasoning segmentation framework that enhances segmentation accuracy in complex scenarios by using GPT's chain of thought to generate detailed image descriptions, which are then processed by a segmentation assistant.

Authors:Yubo Zhao, Qi Wu, Yifan Wang, Yu-Wing Tai, Chi-Keung Tang
Title: Navigating Motion Agents in Dynamic and Cluttered Environments through LLM Reasoning
Abstract:
This paper advances motion agents empowered by large language models (LLMs) toward autonomous navigation in dynamic and cluttered environments, significantly surpassing first and recent seminal but limited studies on LLM's spatial reasoning, where movements are restricted in four directions in simple, static environments in the presence of only single agents much less multiple agents. Specifically, we investigate LLMs as spatial reasoners to overcome these limitations by uniformly encoding environments (e.g., real indoor floorplans), agents which can be dynamic obstacles and their paths as discrete tokens akin to language tokens. Our training-free framework supports multi-agent coordination, closed-loop replanning, and dynamic obstacle avoidance without retraining or fine-tuning. We show that LLMs can generalize across agents, tasks, and environments using only text-based interactions, opening new possibilities for semantically grounded, interactive navigation in both simulation and embodied systems.
中文: 本研究利用大型语言模型提升运动智能体在复杂动态环境中的自主导航能力,通过将环境和智能体编码为类语言标记,无需重新训练即可实现多智能体协调和动态避障。
English: This study enhances motion agents with large language models for autonomous navigation in complex, dynamic settings, overcoming prior limitations by encoding environments and agents as language-like tokens to enable multi-agent coordination and obstacle avoidance without retraining.

Authors:Zixuan Wang, Chi-Keung Tang, Yu-Wing Tai
Title: ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation
Abstract:
Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on-screen" sounds require temporally-aligned audio generation, while "off-screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi-turn conversations with other agents for on-screen and off-screen sound generation through multimodal LLM. To address on-screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time-varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross-attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
中文: 本文提出了一种由声音导演监督的多智能体框架,通过多模态大语言模型和专门角色代理,生成与视频同步的屏幕内声音及补充性的屏幕外环境音效,以增强电影叙事的音频表现力。
English: This paper introduces a multi-agent framework supervised by a Sound Director for generating synchronized on-screen and off-screen audio in movies, utilizing multimodal LLM and specialized agents to produce temporally aligned sounds and complementary background audio.

Authors:Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang
Title: Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
Abstract:
Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate interface elements based on natural language instructions rely solely on immediate prediction without reasoning, struggling to understand complex interface layouts with nested structures and hierarchical relationships, limiting their effectiveness on complex interfaces. Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis. The framework dynamically switches between rapid and deliberate processing through an adaptive system switching based on task complexity, optimizing both efficiency and accuracy. Focus decomposes grounding into progressive stages: interface summarization, visual focused analysis, and precise coordinate prediction. This structured decomposition enables systematic understanding of both interface layouts and visual relationships. Extensive experiments show that Focus achieves state-of-the-art performance using only 300K of the training data with a 2B parameter model compared to existing approaches. Focus demonstrates superior performance particularly in complex GUI scenarios, achieving 77.4% average accuracy on ScreenSpot and 13.3% on the more challenging ScreenSpot-Pro. Our analysis reveals the effectiveness of this dual-system approach while demonstrating its potential for improving complex GUI interaction scenarios.
中文:受人类双系统认知启发,Focus是一种新型GUI定位框架,结合快速预测与系统分析,根据任务复杂度动态切换处理模式,在复杂界面场景中实现了最优性能。
English: Inspired by human dual-system cognition, Focus is a novel GUI grounding framework that combines rapid prediction with systematic analysis, dynamically switching between processing modes based on task complexity to achieve state-of-the-art performance in complex interface scenarios.

Authors:Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal
Title: Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
Abstract:
Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting task-level experts is often too coarse-grained, as heterogeneous tasks may require different expertise per instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework. Symbolic-MoE takes a fine-grained approach to selection by emphasizing skills, e.g., algebra in math or molecular biology in biomedical reasoning. We propose a skill-based recruiting strategy that dynamically selects the most relevant set of expert LLMs for diverse reasoning tasks based on their strengths. Each selected expert then generates its own reasoning, resulting in k outputs from k experts, which are then synthesized into a final high-quality response by an aggregator chosen based on its ability to integrate diverse reasoning outputs. We show that Symbolic-MoE's instance-level expert selection improves performance by a large margin but -- when implemented naively -- can introduce a high computational overhead due to the need for constant model loading and offloading. To address this, we implement a batch strategy that groups instances based on their assigned experts, loading each model only once. This allows us to integrate 16 expert models on 1 GPU with a time cost comparable to or better than prior multi-agent baselines using 4 GPUs. Through extensive evaluations on diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), we show that Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline. Moreover, Symbolic-MoE generalizes well to unseen tasks and removes the need for expensive multi-round discussions, outperforming discussion baselines with less computation.
Chinese: Symbolic-MoE是一种基于符号的、无需梯度的专家混合框架,通过实例级技能化筛选和整合预训练大语言模型专家,在多样化推理任务中实现了卓越性能,同时保持了高效的计算资源利用率。
English: Symbolic-MoE is a symbolic, gradient-free Mixture-of-Experts framework that dynamically selects and combines pre-trained LLM experts at the instance level based on specific skills, achieving superior performance on diverse reasoning tasks with efficient computational resource usage.

Authors:Haoyuan Ma, Yongliang Shen, Hengwei Liu, Wenqi Zhang, Haolei Xu, Qiuying Peng, Jun Wang, Weiming Lu
Title: DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL
Abstract:
Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL. However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding. To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis. DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs. Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models. Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 67.0% on BIRD and 87.8% on SPIDER. Notably, our open-source implementation based on Qwen2.5-Coder-7B achieves state-of-the-art results at minimal computational cost, outperforming several GPT-4-driven Text-to-SQL systems.
中文摘要:DB-Explore框架通过自动化探索和指令合成,将大型语言模型与数据库知识系统对齐,在SPIDER和BIRD基准测试中以最小计算成本实现了最先进的性能。
English Summary: The DB-Explore framework enhances text-to-SQL systems by systematically aligning large language models with comprehensive database knowledge through automated exploration and instruction synthesis, achieving state-of-the-art results on SPIDER and BIRD benchmarks with minimal computational cost.

Authors:Nguyen Quang Hieu, Dinh Thai Hoang, Diep N. Nguyen, Mohammad Abu Alsheikh, Carlos C. N. Kuhn, Yibeltal F. Alem, Ibrahim Radwan
Title: End-to-End Human Pose Reconstruction from Wearable Sensors for 6G Extended Reality Systems
Abstract:
Full 3D human pose reconstruction is a critical enabler for extended reality (XR) applications in future sixth generation (6G) networks, supporting immersive interactions in gaming, virtual meetings, and remote collaboration. However, achieving accurate pose reconstruction over wireless networks remains challenging due to channel impairments, bit errors, and quantization effects. Existing approaches often assume error-free transmission in indoor settings, limiting their applicability to real-world scenarios. To address these challenges, we propose a novel deep learning-based framework for human pose reconstruction over orthogonal frequency-division multiplexing (OFDM) systems. The framework introduces a two-stage deep learning receiver: the first stage jointly estimates the wireless channel and decodes OFDM symbols, and the second stage maps the received sensor signals to full 3D body poses. Simulation results demonstrate that the proposed neural receiver reduces bit error rate (BER), thus gaining a 5 dB gap at $10^{-4}$ BER, compared to the baseline method that employs separate signal detection steps, i.e., least squares channel estimation and linear minimum mean square error equalization. Additionally, our empirical findings show that 8-bit quantization is sufficient for accurate pose reconstruction, achieving a mean squared error of $5\times10^{-4}$ for reconstructed sensor signals, and reducing joint angular error by 37\% for the reconstructed human poses compared to the baseline.
中文: 针对OFDM系统提出的深度学习框架采用两阶段接收器,联合处理信道估计和姿态映射,相比传统方法显著降低了误码率并提升了三维人体姿态重建精度。
English: The proposed deep learning framework for 3D human pose reconstruction over OFDM systems employs a two-stage receiver that jointly handles channel estimation and pose mapping, significantly reducing bit error rates and improving reconstruction accuracy compared to conventional methods.

Authors:Marco Giberna, Muhammad Shaheer, Hriday Bavle, Jose Andres Millan-Romera, Jose Luis Sanchez-Lopez, Holger Voos
Title: Constraint-Based Modeling of Dynamic Entities in 3D Scene Graphs for Robust SLAM
Abstract:
Autonomous robots depend crucially on their ability to perceive and process information from dynamic, ever-changing environments. Traditional simultaneous localization and mapping (SLAM) approaches struggle to maintain consistent scene representations because of numerous moving objects, often treating dynamic elements as outliers rather than explicitly modeling them in the scene representation. In this paper, we present a novel hierarchical 3D scene graph-based SLAM framework that addresses the challenge of modeling and estimating the pose of dynamic objects and agents. We use fiducial markers to detect dynamic entities and to extract their attributes while improving keyframe selection and implementing new capabilities for dynamic entity mapping. We maintain a hierarchical representation where dynamic objects are registered in the SLAM graph and are constrained with robot keyframes and the floor level of the building with our novel entity-keyframe constraints and intra-entity constraints. By combining semantic and geometric constraints between dynamic entities and the environment, our system jointly optimizes the SLAM graph to estimate the pose of the robot and various dynamic agents and objects while maintaining an accurate map. Experimental evaluation demonstrates that our approach achieves a 27.57% reduction in pose estimation error compared to traditional methods and enables higher-level reasoning about scene dynamics.
Chinese: 本文提出了一种基于分层三维场景图的SLAM框架,通过使用基准标记和新颖约束条件显式建模动态对象,相比传统方法将位姿估计误差降低了27.57%。
English: This paper introduces a hierarchical 3D scene graph SLAM framework that explicitly models dynamic objects using fiducial markers and novel constraints, achieving a 27.57% reduction in pose estimation error compared to traditional methods.

Authors:Xuan Zhang, Yongliang Shen, Zhe Zheng, Linjuan Wu, Wenqi Zhang, Yuchen Yan, Qiuying Peng, Jun Wang, Weiming Lu
Title: AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets, which inherently constrains training data scale and diversity, and lack of error correction mechanisms during multi-turn clarification, leading to error accumulation that compromises both accuracy and efficiency. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness through error-correction pairs and selective masking, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 57% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 10.46% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4o with substantially fewer computational resources.
中文摘要:AskToAct是一种创新框架,通过参数移除自动构建训练数据并集成纠错机制,显著提升大语言模型在工具学习中的性能,能够在保持高精度的同时有效澄清模糊用户查询,且计算资源需求大幅降低。
English Summary: AskToAct is a novel framework that enhances tool learning in large language models by automatically generating training data through parameter removal and incorporating error-correction mechanisms, achieving superior accuracy and efficiency in clarifying ambiguous user queries while requiring fewer computational resources.

Authors:Cheng Tan, Yijie Zhang, Zhangyang Gao, Yufei Huang, Haitao Lin, Lirong Wu, Fandi Wu, Mathieu Blanchette, Stan. Z. Li
Title: dyAb: Flow Matching for Flexible Antibody Design with AlphaFold-driven Pre-binding Antigen
Abstract:
The development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies. Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process, significantly impacting the reliability of the resulting antibodies. To bridge this gap, we introduce dyAb, a flexible framework that incorporates AlphaFold2-driven predictions to model pre-binding antigen structures and specifically addresses the dynamic nature of antigen conformation changes. Our dyAb model leverages a unique combination of coarse-grained interface alignment and fine-grained flow matching techniques to simulate the interaction dynamics and structural evolution of the antigen-antibody complex, providing a realistic representation of the binding process. Extensive experiments show that dyAb significantly outperforms existing models in antibody design involving changing antigen conformations. These results highlight dyAb's potential to streamline the design process for therapeutic antibodies, promising more efficient development cycles and improved outcomes in clinical applications.
中文:dyAb框架通过整合AlphaFold2预测来模拟动态抗原构象变化,显著提升了治疗性抗体设计的准确性,在实验中超越现有模型并有望优化临床开发流程。
English: The dyAb framework enhances therapeutic antibody design by integrating AlphaFold2 predictions to model dynamic antigen conformational changes, outperforming existing methods and improving clinical development efficiency.

Authors:Ali Tourani, Saad Ejaz, Hriday Bavle, David Morilla-Cabello, Jose Luis Sanchez-Lopez, Holger Voos
Title: vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding
Abstract:
Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats like scene graphs has not been widely addressed, encountering complex map comprehension and limited scalability. This paper introduces visual S-Graphs (vS-Graphs), a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and corridors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs outperforms state-of-the-art VSLAM methods, reducing trajectory error by an average of 3.38% and up to 9.58% on real-world data. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to precise LiDAR-based frameworks using only visual features. A web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.
中文: 本文提出vS-Graphs实时视觉SLAM框架,通过结合场景理解与三维场景图构建,提升了地图语义丰富度、可理解性和定位精度,在轨迹误差减少方面优于现有方法。
English: This paper introduces vS-Graphs, a real-time VSLAM framework that integrates scene understanding with 3D scene graph construction to enhance semantic richness, map comprehensibility, and localization accuracy, outperforming existing methods in trajectory error reduction.

Authors:Mingcong Lei, Ge Wang, Yiming Zhao, Zhixin Mai, Qing Zhao, Yao Guo, Zhen Li, Shuguang Cui, Yatong Han, Jinke Ren
Title: CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments
Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities in the hierarchical decomposition of complex tasks through semantic reasoning. However, their application in embodied systems faces challenges in ensuring reliable execution of subtask sequences and achieving one-shot success in long-term task completion. To address these limitations in dynamic environments, we propose Closed-Loop Embodied Agent (CLEA) -- a novel architecture incorporating four specialized open-source LLMs with functional decoupling for closed-loop task management. The framework features two core innovations: (1) Interactive task planner that dynamically generates executable subtasks based on the environmental memory, and (2) Multimodal execution critic employing an evaluation framework to conduct a probabilistic assessment of action feasibility, triggering hierarchical re-planning mechanisms when environmental perturbations exceed preset thresholds. To validate CLEA's effectiveness, we conduct experiments in a real environment with manipulable objects, using two heterogeneous robots for object search, manipulation, and search-manipulation integration tasks. Across 12 task trials, CLEA outperforms the baseline model, achieving a 67.3% improvement in success rate and a 52.8% increase in task completion rate. These results demonstrate that CLEA significantly enhances the robustness of task planning and execution in dynamic environments.
中文:提出的闭环具身智能体(CLEA)框架集成四个功能解耦的专用大语言模型,通过动态任务规划和多模态执行评估机制,在动态环境中显著提升了任务成功率与完成率。
English: The proposed Closed-Loop Embodied Agent (CLEA) framework integrates four specialized LLMs with functional decoupling to enhance task planning and execution robustness in dynamic environments, achieving significant improvements in success and completion rates over baseline models.

Authors:Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li
Title: A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models
Abstract:
With the advancement of web techniques, they have significantly revolutionized various aspects of people's lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents -- termed WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?' To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.
中文: 随着网络技术的发展,基于人工智能的WebAgent应运而生,利用大型基础模型自动处理繁琐的在线任务,显著提升效率与便利性;本综述从架构、训练和可信度三方面回顾现有研究,并探讨了未来研究方向。
English: Web advancements have spurred the development of AI-driven WebAgents that automate tedious online tasks, leveraging large foundation models to enhance efficiency and convenience, with this survey reviewing their architectures, training, and trustworthiness while suggesting future research directions.

Authors:Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang
Title: Mixture of Lookup Experts
Abstract:
Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.
中文:MoLE(专家查找混合)将专家转换为查找表,在推理时直接检索输出结果,以可忽略的通信开销实现了与密集模型相当的速度,并显著优于需卸载专家的MoE模型性能。
English: Mixture of Lookup Experts (MoLE) transforms experts into lookup tables for direct output retrieval during inference, achieving comparable speed to dense models and superior performance to offloaded MoE with minimal communication overhead.

Authors:Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, Conghui He
Title: LEGION: Learning to Ground and Explain for Synthetic Image Detection
Abstract:
The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.
中文: 本文提出了包含精细标注的合成图像数据集SynthScars,并开发了多模态框架LEGION,该框架不仅能出色地检测和解释图像伪影,还能通过优化流程指导生成更逼真的图像。
English: This paper introduces SynthScars, a comprehensive dataset with fine-grained annotations of synthetic images, and proposes LEGION, a multimodal framework that excels in detecting and explaining image artifacts while also guiding the generation of more realistic images through refinement pipelines.

Authors:Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
Title: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Abstract:
Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.
中文: 本文通过构建排版视觉提示注入数据集,全面评估了其对多种大型视觉语言模型和图像生成模型的安全威胁,揭示了这些模型在面对语义对齐的视觉提示时易产生破坏性输出的风险。
English: This paper investigates the security risks of Typographic Visual Prompt Injection (TVPI) by creating a dataset and evaluating its impact on various Large Vision Language Models and Image-to-Image Generation Models, revealing their vulnerability to disruptive outputs aligned with injected visual prompts.

Authors:Yi Zhang, Qiang Zhang, Xiaozhu Ju, Zhaoyang Liu, Jilei Mao, Jingkai Sun, Jintao Wu, Shixiong Gao, Shihan Cai, Zhiyuan Qin, Linkai Liang, Jiaxu Wang, Yiqun Duan, Jiahang Cao, Renjing Xu, Jian Tang
Title: EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
Abstract:
While multimodal large language models (MLLMs) have made groundbreaking progress in embodied intelligence, they still face significant challenges in spatial reasoning for complex long-horizon tasks. To address this gap, we propose EmbodiedVSR (Embodied Visual Spatial Reasoning), a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning to enhance spatial understanding for embodied agents. By explicitly constructing structured knowledge representations through dynamic scene graphs, our method enables zero-shot spatial reasoning without task-specific fine-tuning. This approach not only disentangles intricate spatial relationships but also aligns reasoning steps with actionable environmental dynamics. To rigorously evaluate performance, we introduce the eSpatial-Benchmark, a comprehensive dataset including real-world embodied scenarios with fine-grained spatial annotations and adaptive task difficulty levels. Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence, particularly in long-horizon tasks requiring iterative environment interaction. The results reveal the untapped potential of MLLMs for embodied intelligence when equipped with structured, explainable reasoning mechanisms, paving the way for more reliable deployment in real-world spatial applications. The codes and datasets will be released soon.
中文: 提出的EmbodiedVSR框架通过动态场景图与思维链推理的结合,增强了多模态大语言模型的空间推理能力,在无需任务特定训练的情况下,于eSpatial基准测试中实现了卓越性能。
English: The proposed EmbodiedVSR framework enhances spatial reasoning in multimodal large language models by integrating dynamic scene graphs with Chain-of-Thought reasoning, achieving superior performance on the eSpatial-Benchmark without task-specific training.

Authors:Qiang Zhang, Jiahang Cao, Jingkai Sun, Yecheng Shao, Gang Han, Wen Zhao, Yijie Guo, Renjing Xu
Title: ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network
Abstract:
In recent years, quadruped robotics has advanced significantly, particularly in perception and motion control via reinforcement learning, enabling complex motions in challenging environments. Visual sensors like depth cameras enhance stability and robustness but face limitations, such as low operating frequencies relative to joint control and sensitivity to lighting, which hinder outdoor deployment. Additionally, deep neural networks in sensor and control systems increase computational demands. To address these issues, we introduce spiking neural networks (SNNs) and event cameras to perform a challenging quadruped parkour task. Event cameras capture dynamic visual data, while SNNs efficiently process spike sequences, mimicking biological perception. Experimental results demonstrate that this approach significantly outperforms traditional models, achieving excellent parkour performance with just 11.7% of the energy consumption of an artificial neural network (ANN)-based model, yielding an 88.3% energy reduction. By integrating event cameras with SNNs, our work advances robotic reinforcement learning and opens new possibilities for applications in demanding environments.
中文总结:本研究采用脉冲神经网络和事件相机解决四足机器人技术瓶颈,在完成高难度跑酷任务时,能耗比传统模型降低88.3%,性能表现显著提升。
English Summary: This study introduces spiking neural networks and event cameras to overcome limitations in quadruped robotics, achieving superior parkour performance with 88.3% lower energy consumption than traditional models.

Authors:Qiang Zhang, Zhang Zhang, Wei Cui, Jingkai Sun, Jiahang Cao, Yijie Guo, Gang Han, Wen Zhao, Jiaxu Wang, Chenghao Sun, Lingfeng Zhang, Hao Cheng, Yujie Chen, Lin Wang, Jian Tang, Renjing Xu
Title: HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots
Abstract:
The perceptual system design for humanoid robots poses unique challenges due to inherent structural constraints that cause severe self-occlusion and limited field-of-view (FOV). We present HumanoidPano, a novel hybrid cross-modal perception framework that synergistically integrates panoramic vision and LiDAR sensing to overcome these limitations. Unlike conventional robot perception systems that rely on monocular cameras or standard multi-sensor configurations, our method establishes geometrically-aware modality alignment through a spherical vision transformer, enabling seamless fusion of 360 visual context with LiDAR's precise depth measurements. First, Spherical Geometry-aware Constraints (SGC) leverage panoramic camera ray properties to guide distortion-regularized sampling offsets for geometric alignment. Second, Spatial Deformable Attention (SDA) aggregates hierarchical 3D features via spherical offsets, enabling efficient 360°-to-BEV fusion with geometrically complete object representations. Third, Panoramic Augmentation (AUG) combines cross-view transformations and semantic alignment to enhance BEV-panoramic feature consistency during data augmentation. Extensive evaluations demonstrate state-of-the-art performance on the 360BEV-Matterport benchmark. Real-world deployment on humanoid platforms validates the system's capability to generate accurate BEV segmentation maps through panoramic-LiDAR co-perception, directly enabling downstream navigation tasks in complex environments. Our work establishes a new paradigm for embodied perception in humanoid robotics.
中文摘要:HumanoidPano通过全景视觉与激光雷达的跨模态融合,结合球形变换器解决了仿人机器人固有的自遮挡和视野限制问题,实现了卓越的360度环境感知能力,直接支撑复杂场景下的导航任务。
English Summary: HumanoidPano is a hybrid perception framework that integrates panoramic vision and LiDAR through spherical transformers to overcome self-occlusion and limited field-of-view in humanoid robots, achieving state-of-the-art 360° environmental perception for navigation tasks.

Authors:Qiang Zhang, Gang Han, Jingkai Sun, Wen Zhao, Jiahang Cao, Jiaxu Wang, Hao Cheng, Lingfeng Zhang, Yijie Guo, Renjing Xu
Title: LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures
Abstract:
In recent years, research on humanoid robots has garnered significant attention, particularly in reinforcement learning based control algorithms, which have achieved major breakthroughs. Compared to traditional model-based control algorithms, reinforcement learning based algorithms demonstrate substantial advantages in handling complex tasks. Leveraging the large-scale parallel computing capabilities of GPUs, contemporary humanoid robots can undergo extensive parallel training in simulated environments. A physical simulation platform capable of large-scale parallel training is crucial for the development of humanoid robots. As one of the most complex robot forms, humanoid robots typically possess intricate mechanical structures, encompassing numerous series and parallel mechanisms. However, many reinforcement learning based humanoid robot control algorithms currently employ open-loop topologies during training, deferring the conversion to series-parallel structures until the sim2real phase. This approach is primarily due to the limitations of physics engines, as current GPU-based physics engines often only support open-loop topologies or have limited capabilities in simulating multi-rigid-body closed-loop topologies. For enabling reinforcement learning-based humanoid robot control algorithms to train in large-scale parallel environments, we propose a novel training method LiPS. By incorporating multi-rigid-body dynamics modeling in the simulation environment, we significantly reduce the sim2real gap and the difficulty of converting to parallel structures during model deployment, thereby robustly supporting large-scale reinforcement learning for humanoid robots.
中文:近年来,人形机器人强化学习控制算法虽利用GPU并行模拟取得突破,却受限于闭环拓扑模拟;我们提出的LiPS方法通过多刚体动力学建模缩小仿真与现实差距,显著提升了大规模并行训练的鲁棒性。
English: Recent advances in reinforcement learning for humanoid robot control leverage GPU-powered parallel simulation, yet face limitations with closed-loop topologies; our proposed LiPS method overcomes these by integrating multi-rigid-body dynamics to bridge the sim2real gap and enable robust large-scale training.

Authors:Jingkai Sun, Qiang Zhang, Gang Han, Wen Zhao, Zhe Yong, Yan He, Jiaxu Wang, Jiahang Cao, Yijie Guo, Renjing Xu
Title: Trinity: A Modular Humanoid Robot AI System
Abstract:
In recent years, research on humanoid robots has garnered increasing attention. With breakthroughs in various types of artificial intelligence algorithms, embodied intelligence, exemplified by humanoid robots, has been highly anticipated. The advancements in reinforcement learning (RL) algorithms have significantly improved the motion control and generalization capabilities of humanoid robots. Simultaneously, the groundbreaking progress in large language models (LLM) and visual language models (VLM) has brought more possibilities and imagination to humanoid robots. LLM enables humanoid robots to understand complex tasks from language instructions and perform long-term task planning, while VLM greatly enhances the robots' understanding and interaction with their environment. This paper introduces \textcolor{magenta}{Trinity}, a novel AI system for humanoid robots that integrates RL, LLM, and VLM. By combining these technologies, Trinity enables efficient control of humanoid robots in complex environments. This innovative approach not only enhances the capabilities but also opens new avenues for future research and applications of humanoid robotics.
中文摘要:近年来,强化学习、大语言模型和视觉语言模型的突破性进展显著提升了人形机器人的运动控制与环境交互能力,由此开发的Trinity集成AI系统通过技术融合增强了机器人在复杂环境中的操作性能。
English Summary: Recent advances in reinforcement learning, large language models, and visual language models have significantly enhanced humanoid robots' motion control and environmental interaction, leading to the development of Trinity—an integrated AI system that improves robot capabilities in complex settings.

Authors:Qiang Zhang, Gang Han, Jingkai Sun, Wen Zhao, Chenghao Sun, Jiahang Cao, Jiaxu Wang, Yijie Guo, Renjing Xu
Title: Distillation-PPO: A Novel Two-Stage Reinforcement Learning Framework for Humanoid Robot Perceptive Locomotion
Abstract:
In recent years, humanoid robots have garnered significant attention from both academia and industry due to their high adaptability to environments and human-like characteristics. With the rapid advancement of reinforcement learning, substantial progress has been made in the walking control of humanoid robots. However, existing methods still face challenges when dealing with complex environments and irregular terrains. In the field of perceptive locomotion, existing approaches are generally divided into two-stage methods and end-to-end methods. Two-stage methods first train a teacher policy in a simulated environment and then use distillation techniques, such as DAgger, to transfer the privileged information learned as latent features or actions to the student policy. End-to-end methods, on the other hand, forgo the learning of privileged information and directly learn policies from a partially observable Markov decision process (POMDP) through reinforcement learning. However, due to the lack of supervision from a teacher policy, end-to-end methods often face difficulties in training and exhibit unstable performance in real-world applications. This paper proposes an innovative two-stage perceptive locomotion framework that combines the advantages of teacher policies learned in a fully observable Markov decision process (MDP) to regularize and supervise the student policy. At the same time, it leverages the characteristics of reinforcement learning to ensure that the student policy can continue to learn in a POMDP, thereby enhancing the model's upper bound. Our experimental results demonstrate that our two-stage training framework achieves higher training efficiency and stability in simulated environments, while also exhibiting better robustness and generalization capabilities in real-world applications.
中文摘要:本文提出一种创新的两阶段感知运动框架,通过结合教师策略监督与强化学习,提升了人形机器人的行走控制能力,在仿真环境中表现出更高训练效率与稳定性,并在实际应用中具备更优的鲁棒性和泛化能力。
English Summary: This paper introduces an innovative two-stage perceptive locomotion framework that enhances humanoid robot walking control by combining teacher policy supervision with reinforcement learning, achieving superior training efficiency, stability, and real-world robustness.

Authors:Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen
Title: Accurate INT8 Training Through Dynamic Block-Level Fallback
Abstract:
Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.
中文:回退量化通过动态将包含异常值的激活块从8位精度切换至16位,有效解决了现代Transformer变体中的激活异常值问题,在RTX4090显卡上实现了1.57倍的训练加速。
English: Fallback Quantization addresses activation outliers in modern Transformer variants by dynamically switching from 8-bit to 16-bit precision, achieving robust performance and a 1.57x training speedup on RTX4090 GPUs.

Authors:Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
Title: LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Abstract:
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
中文: LLMVoX是一种轻量级、与LLM无关的流式TTS系统,能以低延迟生成高质量语音并保持基础LLM能力,支持无缝无限对话和多语言任务,无需额外多模态训练。
English: LLMVoX is a lightweight, LLM-agnostic streaming TTS system that generates high-quality speech with low latency while preserving the base LLM's capabilities, supporting seamless infinite dialogues and multilingual tasks without additional multimodal training.

Authors:Xiaoqi Wang, Hongyang Du, Yuehong Gao, Dong In Kim
Title: AOLO: Analysis and Optimization For Low-Carbon Oriented Wireless Large Language Model Services
Abstract:
Recent advancements in large language models (LLMs) have led to their widespread adoption and large-scale deployment across various domains. However, their environmental impact, particularly during inference, has become a growing concern due to their substantial energy consumption and carbon footprint. Existing research has focused on inference computation alone, overlooking the analysis and optimization of carbon footprint in network-aided LLM service systems. To address this gap, we propose AOLO, a framework for analysis and optimization for low-carbon oriented wireless LLM services. AOLO introduces a comprehensive carbon footprint model that quantifies greenhouse gas emissions across the entire LLM service chain, including computational inference and wireless communication. Furthermore, we formulate an optimization problem aimed at minimizing the overall carbon footprint, which is solved through joint optimization of inference outputs and transmit power under quality-of-experience and system performance constraints. To achieve this joint optimization, we leverage the energy efficiency of spiking neural networks (SNNs) by adopting SNN as the actor network and propose a low-carbon-oriented optimization algorithm, i.e., SNN-based deep reinforcement learning (SDRL). Comprehensive simulations demonstrate that SDRL algorithm significantly reduces overall carbon footprint, achieving an 18.77% reduction compared to the benchmark soft actor-critic, highlighting its potential for enabling more sustainable LLM inference services.
中文: AOLO框架通过引入全面的碳足迹模型,并利用脉冲神经网络联合优化推理和传输功率,显著降低了大型语言模型服务系统的碳排放,实现了更可持续的推理服务。
English: The AOLO framework addresses the environmental impact of large language models by introducing a comprehensive carbon footprint model and optimizing emissions through joint inference and power management using spiking neural networks, achieving significant reductions in carbon output.

Authors:Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, Zhen Xiang
Title: A Practical Memory Injection Attack against LLM Agents
Abstract:
Agents based on large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, that enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps leading to undesirable agent actions when executing the victim user's query. Specifically, we introduce a sequence of bridging steps to link the victim query to the malicious reasoning steps. During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps. We also propose a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing the victim query comes after. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
Chinese: 本文提出MINJA记忆注入攻击方法,通过查询交互向智能体记忆库植入恶意记录,使其在处理用户请求时产生有害输出。
English: This paper introduces MINJA, a memory injection attack that compromises LLM agents by inserting malicious records through queries, leading to harmful outputs when processing user requests.

Authors:Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen
Title: Identifying Sensitive Weights via Post-quantization Integral
Abstract:
Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
中文摘要:本文提出了后量化积分(PQI)作为更准确的大语言模型权重量化敏感度指标,并开发了ReQuant框架,通过自适应异常值选择和逐步重要权重分离技术,显著提升了现有后训练量化方法的性能。
English Summary: This paper introduces Post-quantization Integral (PQI) as a more accurate sensitivity metric for LLM weight quantization and proposes ReQuant, a framework that significantly improves existing post-training quantization methods by achieving better perplexity scores.

Authors:Junxiao Yang, Zhexin Zhang, Shiyao Cui, Hongning Wang, Minlie Huang
Title: Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
Abstract:
Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.
中文摘要:本研究通过识别并消除冗余约束,显著提升了基于梯度的越狱攻击在不同大语言模型间的迁移性,将跨模型迁移攻击成功率从18.4%提升至50.3%,同时改善了攻击行为的稳定性和可控性。
English Summary: This research enhances the transferability of gradient-based jailbreaking attacks on Large Language Models by identifying and removing superfluous constraints, significantly increasing the transfer attack success rate from 18.4% to 50.3% across various target models.

Authors:Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yangfan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, Bing Qin
Title: Enhancing Non-English Capabilities of English-Centric Large Language Models through Deep Supervision Fine-Tuning
Abstract:
Large language models (LLMs) have demonstrated significant progress in multilingual language understanding and generation. However, due to the imbalance in training data, their capabilities in non-English languages are limited. Recent studies revealed the English-pivot multilingual mechanism of LLMs, where LLMs implicitly convert non-English queries into English ones at the bottom layers and adopt English for thinking at the middle layers. However, due to the absence of explicit supervision for cross-lingual alignment in the intermediate layers of LLMs, the internal representations during these stages may become inaccurate. In this work, we introduce a deep supervision fine-tuning method (DFT) that incorporates additional supervision in the internal layers of the model to guide its workflow. Specifically, we introduce two training objectives on different layers of LLMs: one at the bottom layers to constrain the conversion of the target language into English, and another at the middle layers to constrain reasoning in English. To effectively achieve the guiding purpose, we designed two types of supervision signals: logits and feature, which represent a stricter constraint and a relatively more relaxed guidance. Our method guides the model to not only consider the final generated result when processing non-English inputs but also ensure the accuracy of internal representations. We conducted extensive experiments on typical English-centric large models, LLaMA-2 and Gemma-2, and the results on multiple multilingual datasets show that our method significantly outperforms traditional fine-tuning methods.
中文: 本研究提出一种深度监督微调方法,通过在模型内部层添加显式监督来提升大语言模型的多语言性能,显著优于传统微调方法。
English: This study introduces a deep supervision fine-tuning method that enhances multilingual performance in large language models by adding explicit supervision to internal layers, significantly outperforming traditional fine-tuning approaches.

Authors:Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen
Title: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop
Abstract:
Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow -- prediction, critique (reflect), and refinement -- continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9\% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.
中文摘要:TimeXL是一种创新的多模态框架,通过结合时间序列数据与大型语言模型,采用预测、反思和优化的闭环工作流程,显著提升了预测精度并生成可解释的多模态分析结果。
English Summary: TimeXL is a novel multi-modal framework that integrates time series data with large language models to enhance prediction accuracy and provide interpretable explanations through a closed-loop process of prediction, reflection, and refinement.

Authors:Jing Zhu, Mingxuan Ju, Yozen Liu, Danai Koutra, Neil Shah, Tong Zhao
Title: Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics
Abstract:
Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.
中文摘要:生成式推荐模型对模态选择极为敏感,本研究提出的增强型后期融合框架MGR-LF++通过对比模态对齐和特殊标记有效整合多模态信息,相比单模态方法实现了超过20%的性能提升。
English Summary: Generative recommendation models are highly sensitive to modality choices, and this study introduces an enhanced late fusion framework, MGR-LF++, which leverages multiple modalities through contrastive alignment and special tokens to achieve over 20% performance gains compared to single-modality approaches.

Authors:Andrea Boscolo Camiletto, Jian Wang, Eduardo Alvarado, Rishabh Dabral, Thabo Beeler, Marc Habermann, Christian Theobalt
Title: FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video
Abstract:
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs. Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input -- device pose and camera feeds -- is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. Lastly, we showcase a novel training strategy to enhance the model's generalization capabilities. Our approach exploits the problem's geometric properties, yielding high-quality motion capture free from common artifacts in prior works. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method. Data, code, and CAD designs will be available at https://vcai.mpi-inf.mpg.de/projects/FRAME/
中文: 本研究提出FRAME轻量级VR采集系统与架构,通过几何合理的多模态融合整合设备位姿与相机数据,实现了实时高质量人体姿态预测,有效解决了现有方法在真实场景中的局限性。
English: Our work introduces FRAME, a lightweight VR-based setup and architecture that integrates device pose and camera feeds using geometrically sound multimodal integration to achieve real-time, high-quality body pose prediction, addressing limitations of previous methods in real-world settings.

Authors:Jie Zhang, Zheng Yuan, Zhongqi Wang, Bei Yan, Sibo Wang, Xiangkui Cao, Zonghui Guo, Shiguang Shan, Xilin Chen
Title: REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Abstract:
The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.
大型视觉语言模型(LVLM)需要全面评估,因此开发了REVAL基准,通过14.4万个样本评估可靠性和价值观,发现其在感知方面表现优异,但在伦理、安全和隐私方面存在显著不足。
Large Vision-Language Models (LVLMs) require comprehensive evaluation, leading to the creation of REVAL, a benchmark assessing reliability and values across 144K samples, revealing strengths in perception but vulnerabilities in ethics, safety, and privacy.

Authors:Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu
Title: ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
Abstract:
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
中文:ImViD数据集通过提供高分辨率、多视角的视听完整场景移动采集,为沉浸式立体视频重建设立了新基准,推动了VR/AR技术发展。
English: The ImViD dataset advances VR/AR research by providing high-resolution, multi-view audiovisual captures with full scene coverage and mobility, establishing a benchmark for immersive volumetric video reconstruction.

Authors:Xingtai Lv, Youbang Sun, Kaiyan Zhang, Shang Qu, Xuekai Zhu, Yuchen Fan, Yi Wu, Ermo Hua, Xinwei Long, Ning Ding, Bowen Zhou
Title: Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models
Abstract:
State Space Models (SSMs) have emerged as a promising alternative to the popular transformer-based models and have been increasingly gaining attention. Compared to transformers, SSMs excel at tasks with sequential data or longer contexts, demonstrating comparable performances with significant efficiency gains. In this survey, we provide a coherent and systematic overview for SSMs, including their theoretical motivations, mathematical formulations, comparison with existing model classes, and various applications. We divide the SSM series into three main sections, providing a detailed introduction to the original SSM, the structured SSM represented by S4, and the selective SSM typified by Mamba. We put an emphasis on technicality, and highlight the various key techniques introduced to address the effectiveness and efficiency of SSMs. We hope this manuscript serves as an introduction for researchers to explore the theoretical foundations of SSMs.
中文摘要:状态空间模型(SSMs)正成为替代主流Transformer的高效模型,在序列数据处理中表现优异,本文系统梳理了其理论基础、技术演进与应用场景,为研究者提供全面指引。
English Summary: State Space Models (SSMs) are gaining prominence as efficient alternatives to transformers, excelling in sequential data tasks with comparable performance, and this survey systematically details their theoretical foundations, technical evolution, and applications.

Authors:Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Title: Hybrid Agents for Image Restoration
Abstract:
Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking the cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.
中文摘要:HybridAgent通过快速、慢速和反馈修复代理将多种图像修复模式整合到统一模型中,能够智能处理复杂用户指令和混合失真,有效提升实际应用中的效率和性能。
English Summary: HybridAgent integrates multiple image restoration modes into a unified model using fast, slow, and feedback agents to intelligently handle complex user prompts and mixed distortions, improving efficiency and performance in real-world applications.

Authors:Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang
Title: Unified Dense Prediction of Video Diffusion
Abstract:
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.
中文: 本文提出了一种统一网络,通过将密集预测任务与RGB视频生成紧密结合,从文本提示同时生成视频、实体分割和深度图,在无需增加计算成本的情况下提升了视频一致性和运动流畅性。
English: This paper introduces a unified network that generates videos, entity segmentation, and depth maps from text prompts by integrating dense prediction tasks with RGB generation, improving video consistency and motion efficiency without extra computational cost.

Authors:Weijie Zhou, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang
Title: LightPlanner: Unleashing the Reasoning Capabilities of Lightweight Large Language Models in Task Planning
Abstract:
In recent years, lightweight large language models (LLMs) have garnered significant attention in the robotics field due to their low computational resource requirements and suitability for edge deployment. However, in task planning -- particularly for complex tasks that involve dynamic semantic logic reasoning -- lightweight LLMs have underperformed. To address this limitation, we propose a novel task planner, LightPlanner, which enhances the performance of lightweight LLMs in complex task planning by fully leveraging their reasoning capabilities. Unlike conventional planners that use fixed skill templates, LightPlanner controls robot actions via parameterized function calls, dynamically generating parameter values. This approach allows for fine-grained skill control and improves task planning success rates in complex scenarios. Furthermore, we introduce hierarchical deep reasoning. Before generating each action decision step, LightPlanner thoroughly considers three levels: action execution (feedback verification), semantic parsing (goal consistency verification), and parameter generation (parameter validity verification). This ensures the correctness of subsequent action controls. Additionally, we incorporate a memory module to store historical actions, thereby reducing context length and enhancing planning efficiency for long-term tasks. We train the LightPlanner-1.5B model on our LightPlan-40k dataset, which comprises 40,000 action controls across tasks with 2 to 13 action steps. Experiments demonstrate that our model achieves the highest task success rate despite having the smallest number of parameters. In tasks involving spatial semantic reasoning, the success rate exceeds that of ReAct by 14.9 percent. Moreover, we demonstrate LightPlanner's potential to operate on edge devices.
中文:LightPlanner通过参数化函数调用和分层深度推理提升轻量级大模型在复杂任务规划中的性能,以最少的参数量实现更高成功率并具备边缘部署潜力。
English: LightPlanner enhances lightweight LLMs' complex task planning through parameterized function calls and hierarchical reasoning, achieving higher success rates with minimal parameters and edge deployment capability.

Authors:Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, Jinqiao Wang
Title: PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Abstract:
Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1\% performance improvement.
Chinese: 为解决视觉语言模型在机器人物理可达性理解上的不足,我们提出了PhysVLM模型,通过整合统一的空间物理可达性地图(S-P Map)来增强视觉推理能力,在多个基准测试中实现了显著性能提升。
English: To address the limitations of vision-language models in robotic physical reachability, we propose PhysVLM, a model that integrates a unified Space-Physical Reachability Map (S-P Map) for enhanced visual reasoning, achieving significant performance improvements across multiple benchmarks.

Authors:Yuheng Liu, Xinke Li, Yuning Zhang, Lu Qi, Xin Li, Wenping Wang, Chongshou Li, Xueting Li, Ming-Hsuan Yang
Title: Controllable 3D Outdoor Scene Generation via Scene Graphs
Abstract:
Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.
中文摘要:本研究提出了一种基于场景图控制的新型三维户外场景生成方法,通过条件扩散模型和鸟瞰图嵌入映射技术,实现了与用户输入高度匹配的高质量场景生成。
English Summary: This work introduces a novel method for generating high-quality 3D outdoor scenes using user-friendly scene graphs as control inputs, employing a conditional diffusion model guided by BEV embeddings to achieve precise alignment with user specifications.

Authors:Zhanghao Hu, Hanqi Yan, Qinglin Zhu, Zhenyi Shen, Yulan He, Lin Gui
Title: Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering
Abstract:
Large language models have recently pushed open domain question answering (ODQA) to new frontiers. However, prevailing retriever-reader pipelines often depend on multiple rounds of prompt level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model's latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.
中文:EmbQA框架通过优化查询表示和扩展候选答案多样性,显著提升了开放域问答的准确性和效率,超越了现有方法。
English: The EmbQA framework enhances open domain question answering by improving query representations and diversifying candidate generation, significantly boosting both accuracy and efficiency over existing methods.

Authors:Huayu Zhang, Dongyue Wu, Yuanjie Shao, Nong Sang, Changxin Gao
Title: Object-Aware Video Matting with Cross-Frame Guidance
Abstract:
Recently, trimap-free methods have drawn increasing attention in human video matting due to their promising performance. Nevertheless, these methods still suffer from the lack of deterministic foreground-background cues, which impairs their ability to consistently identify and locate foreground targets over time and mine fine-grained details. In this paper, we present a trimap-free Object-Aware Video Matting (OAVM) framework, which can perceive different objects, enabling joint recognition of foreground objects and refinement of edge details. Specifically, we propose an Object-Guided Correction and Refinement (OGCR) module, which employs cross-frame guidance to aggregate object-level instance information into pixel-level detail features, thereby promoting their synergy. Furthermore, we design a Sequential Foreground Merging augmentation strategy to diversify sequential scenarios and enhance capacity of the network for object discrimination. Extensive experiments on recent widely used synthetic and real-world benchmarks demonstrate the state-of-the-art performance of our OAVM with only an initial coarse mask. The code and model will be available.
Chinese Summary: 本文提出了一种无需三分图的物体感知视频抠图(OAVM)框架,通过物体引导校正与优化模块和序列前景融合增强策略,在仅需初始粗略掩码的情况下实现了最先进的性能,有效提升了前景物体识别与边缘细节处理能力。
English Summary: The paper introduces a trimap-free Object-Aware Video Matting (OAVM) framework that enhances foreground object recognition and edge detail refinement through an Object-Guided Correction and Refinement module and a Sequential Foreground Merging augmentation strategy, achieving state-of-the-art performance with only an initial coarse mask.

Authors:Xinyi Hou, Yanjie Zhao, Shenao Wang, Haoyu Wang
Title: Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Abstract:
The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. We explore future directions for MCP, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Finally, we offer recommendations for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.
中文: 本文从架构生命周期和安全风险角度系统研究模型上下文协议(MCP),提出针对性防护措施,并为AI系统中安全应用该协议规划未来发展路径。
English: This paper systematically examines the Model Context Protocol (MCP) by analyzing its architectural lifecycle and security risks, proposing targeted safeguards and outlining future directions for secure adoption in AI systems.

Authors:Xinyi Hou, Yanjie Zhao, Shenao Wang, Haoyu Wang
Title: Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Abstract:
The Model Context Protocol (MCP) is an emerging open standard that defines a unified, bi-directional communication and dynamic discovery protocol between AI models and external tools or resources, aiming to enhance interoperability and reduce fragmentation across diverse systems. This paper presents a systematic study of MCP from both architectural and security perspectives. We first define the full lifecycle of an MCP server, comprising four phases (creation, deployment, operation, and maintenance), further decomposed into 16 key activities that capture its functional evolution. Building on this lifecycle analysis, we construct a comprehensive threat taxonomy that categorizes security and privacy risks across four major attacker types: malicious developers, external attackers, malicious users, and security flaws, encompassing 16 distinct threat scenarios. To validate these risks, we develop and analyze real-world case studies that demonstrate concrete attack surfaces and vulnerability manifestations within MCP implementations. Based on these findings, the paper proposes a set of fine-grained, actionable security safeguards tailored to each lifecycle phase and threat category, offering practical guidance for secure MCP adoption. We also analyze the current MCP landscape, covering industry adoption, integration patterns, and supporting tools, to identify its technological strengths as well as existing limitations that constrain broader deployment. Finally, we outline future research and development directions aimed at strengthening MCP's standardization, trust boundaries, and sustainable growth within the evolving ecosystem of tool-augmented AI systems.
中文: 本文从架构生命周期和安全风险角度系统研究模型上下文协议(MCP),提出针对性防护措施,并为AI系统中安全应用该协议规划未来发展路径。
English: This paper systematically examines the Model Context Protocol (MCP) by analyzing its architectural lifecycle and security risks, proposing targeted safeguards and outlining future directions for secure adoption in AI systems.

Authors:Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, Zibin Zheng
Title: What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond
Abstract:
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts. While retrieval-augmented generation (RAG) frameworks are widely adopted, the effectiveness of different retrieved information sources-contextual code, APIs, and similar snippets-has not been rigorously analyzed. Through an empirical study on two benchmarks, we demonstrate that in-context code and potential API information significantly enhance LLM performance, whereas retrieved similar code often introduces noise, degrading results by up to 15%. Based on the preliminary results, we propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching. Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
中文: 仓库级代码生成中,上下文代码和API信息能显著提升大语言模型性能,而相似代码检索易引入噪声;基于此提出的AllianceCoder方法通过思维链提示分解查询并语义匹配API,在基准测试中将Pass@1指标最高提升20%。
English: Repository-level code generation is enhanced by in-context code and API information, while similar code retrieval can degrade performance, leading to the development of AllianceCoder, which improves accuracy by up to 20% through stepwise query decomposition and semantic API matching.

Authors:Zike Li, Mingwei Liu, Anji Li, Kaifeng He, Yanlin Wang, Xin Peng, Zibin Zheng
Title: A Preliminary Study on the Robustness of Code Generation by Large Language Models
Abstract:
Robustness is a critical factor for reliable code generation by large language models, yet most evaluations focus on correctness and overlook key issues such as missing input validation and inadequate error handling. In this work, we present the first empirical study of LLM-generated code robustness using the CoderEval benchmark. Evaluating four state-of-the-art code LLMs, we find that 35.2% of their outputs are less robust than human-written code, with over 90% of deficiencies caused by missing conditional checks-70% of which occur in the first line. Interestingly, in 63% of cases where a conditional statement is needed but absent, the "if" token still ranks among the top three predictions, suggesting implicit recognition of control flow. To address these issues, we propose RobGen, a model-agnostic framework that improves robustness without retraining. RobGen combines a line-level intervention checker, which decides whether to adjust logits for each generated line, with token-level conditional logit adjustments to promote essential control structures. Experiments show that RobGen reduces the proportion of less robust code by 10%, achieves the highest average Pass@1 (43.57), and adds minimal overhead (+33.4%). As a lightweight and adaptable solution, RobGen effectively enhances the reliability of LLM-generated code across diverse tasks.
中文摘要:本研究提出RobGen框架,通过解决大语言模型生成代码中缺失条件检查的问题,将低鲁棒性代码比例降低10%,且仅增加少量计算开销。
English Summary: This study introduces RobGen, a model-agnostic framework that enhances code robustness by addressing missing conditional checks in LLM-generated code, reducing less robust outputs by 10% with minimal overhead.

Authors:Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
Title: Scaling Vision Pre-Training to 4K Resolution
Abstract:
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, PS3 and VILA-HD outperform previous vision encoders (e.g., SigLIP2 and Perception Encoder) and MLLMs (e.g., NVILA and Qwen2.5-VL) respectively across multiple benchmarks and achieve better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 16.1% improvement over GPT-4o and a 7.5% improvement and 1.67x speedup over Qwen2.5-VL.
Chinese: PS3通过选择性处理局部区域并与详细描述对比,实现了高效的4K分辨率视觉预训练,显著提升了多模态模型VILA-HD的高清感知能力,同时大幅降低了计算开销。
English: PS3 enables efficient 4K-resolution vision pre-training by selectively processing local regions with detailed captions, significantly enhancing high-resolution perception in multi-modal models like VILA-HD while reducing computational costs.

Authors:Ge Gao, Siyue Teng, Tianhao Peng, Fan Zhang, David Bull
Title: GIViC: Generative Implicit Video Compression
Abstract:
While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.
中文: GIViC框架通过分层门控线性注意力变换器和隐式扩散过程,首次实现了基于隐式神经表示的视频编解码器在随机访问配置下超越传统方法的性能突破。
English: The proposed GIViC framework advances implicit neural representation-based video compression by integrating a hierarchical transformer and implicit diffusion process, achieving state-of-the-art performance with significant BD-rate savings over existing codecs.

Authors:Haozhe Yin, Kai Wang, Wenjie Zhang, Yizhang He, Ying Zhang, Xuemin Lin
Title: Motif Counting in Complex Networks: A Comprehensive Survey
Abstract:
Motif counting plays a crucial role in understanding the structural properties of networks. By computing motif frequencies, researchers can draw key insights into the structural properties of the underlying network. As networks become increasingly complex, different graph models have been proposed, giving rise to diverse motif patterns. These variations introduce unique computational challenges that require specialized algorithms tailored to specific motifs within different graph structures. This survey provides a comprehensive and structured overview of motif counting techniques across general graphs, heterogeneous graphs, and hypergraphs. We categorize existing algorithms according to their underlying computational strategies, emphasizing key similarities and distinctions. In addition to reviewing current methodologies, we examine their strengths, limitations, and computational trade-offs. Furthermore, we explore future directions in motif counting, including scalable implementations to improve efficiency in large-scale networks, algorithmic adaptations for dynamic, temporal, and attributed graphs, and deeper integration with large language models (LLMs) and graph-based retrieval-augmented generation (GraphRAG). By offering a detailed analysis of these approaches, this survey aims to support researchers and practitioners in advancing motif counting for increasingly complex network data.
中文: 本综述系统分类并分析了各类图结构中的模体计数技术,重点探讨了计算策略、局限性及未来方向,如可扩展实现和与大语言模型的结合,以应对复杂网络的分析挑战。
English: This survey comprehensively categorizes and analyzes motif counting techniques across various graph types, highlighting computational strategies, limitations, and future directions like scalable implementations and LLM integration to address complex network challenges.

Authors:Yu-An Liu, Haya Nachimovsky, Ruqing Zhang, Oren Kurland, Jiafeng Guo, Moshe Tennenholtz
Title: Robust-IR @ SIGIR 2025: The First Workshop on Robust Information Retrieval
Abstract:
With the advancement of information retrieval (IR) technologies, robustness is increasingly attracting attention. When deploying technology into practice, we consider not only its average performance under normal conditions but, more importantly, its ability to maintain functionality across a variety of exceptional situations. In recent years, the research on IR robustness covers theory, evaluation, methodology, and application, and all of them show a growing trend. The purpose of this workshop is to systematize the latest results of each research aspect, to foster comprehensive communication within this niche domain while also bridging robust IR research with the broader community, and to promote further future development of robust IR. To avoid the one-sided talk of mini-conferences, this workshop adopts a highly interactive format, including round-table and panel discussion sessions, to encourage active participation and meaningful exchange among attendees.
中文摘要:本次研讨会旨在系统整合信息检索鲁棒性研究在理论、评估与应用方面的最新成果,通过互动讨论促进跨领域交流,推动该方向的持续发展。
English Summary: This workshop aims to consolidate recent advances in information retrieval robustness research across theory, evaluation, and applications, fostering interdisciplinary dialogue through interactive sessions to propel future developments.

Authors:Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang, Ying Zhang, Xuemin Lin
Title: DiffGED: Computing Graph Edit Distance via Diffusion-based Graph Matching
Abstract:
The Graph Edit Distance (GED) problem, which aims to compute the minimum number of edit operations required to transform one graph into another, is a fundamental challenge in graph analysis with wide-ranging applications. However, due to its NP-hard nature, traditional A* approaches often suffer from scalability issue, making them computationally intractable for large graphs. Many recent deep learning frameworks address GED by formulating it as a regression task, which, while efficient, fails to recover the edit path -- a central interest in GED. Furthermore, recent hybrid approaches that combine deep learning with traditional methods to recover the edit path often yield poor solution quality. These methods also struggle to generate candidate solutions in parallel, resulting in increased running times.In this paper, we present a novel approach, DiffGED, that leverages generative diffusion model to solve GED and recover the corresponding edit path. Specifically, we first generate multiple diverse node matching matrices in parallel through a diffusion-based graph matching model. Next, node mappings are extracted from each generated matching matrices in parallel, and each extracted node mapping can be simply transformed into an edit path. Benefiting from the generative diversity provided by the diffusion model, DiffGED is less likely to fall into local sub-optimal solutions, thereby achieving superior overall solution quality close to the exact solution. Experimental results on real-world datasets demonstrate that DiffGED can generate multiple diverse edit paths with exceptionally high accuracy comparable to exact solutions while maintaining a running time shorter than most of hybrid approaches.
中文:DiffGED采用生成式扩散模型,通过并行生成多样化的节点匹配来求解图编辑距离并还原编辑路径,在保持比混合方法更短运行时间的同时,获得了接近精确解的高质量结果。
English: DiffGED introduces a generative diffusion model to efficiently compute Graph Edit Distance and recover edit paths by generating diverse node mappings in parallel, achieving near-exact solution quality with shorter runtime than hybrid methods.

Authors:Haocong Luo, İsmail Emir Yüksel, Ataberk Olgun, A. Giray Yağlıkçı, Onur Mutlu
Title: Revisiting DRAM Read Disturbance: Identifying Inconsistencies Between Experimental Characterization and Device-Level Studies
Abstract:
Modern DRAM is vulnerable to read disturbance (e.g., RowHammer and RowPress) that significantly undermines the robust operation of the system. Repeatedly opening and closing a DRAM row (RowHammer) or keeping a DRAM row open for a long period of time (RowPress) induces bitflips in nearby unaccessed DRAM rows. Prior works on DRAM read disturbance either 1) perform experimental characterization using commercial-off-the-shelf (COTS) DRAM chips to demonstrate the high-level characteristics of the read disturbance bitflips, or 2) perform device-level simulations to understand the low-level error mechanisms of the read disturbance bitflips. In this paper, we attempt to align and cross-validate the real-chip experimental characterization results and state-of-the-art device-level studies of DRAM read disturbance. To do so, we first identify and extract the key bitflip characteristics of RowHammer and RowPress from the device-level error mechanisms studied in prior works. Then, we perform experimental characterization on 96 COTS DDR4 DRAM chips that directly match the data and access patterns studied in the device-level works. Through our experiments, we identify fundamental inconsistencies in the RowHammer and RowPress bitflip directions and access pattern dependence between experimental characterization results and the device-level error mechanisms. Based on our results, we hypothesize that either 1) the retention failure based DRAM architecture reverse-engineering methodologies do not fully work on modern DDR4 DRAM chips, or 2) existing device-level works do not fully uncover all the major read disturbance error mechanisms. We hope our findings inspire and enable future works to build a more fundamental and comprehensive understanding of DRAM read disturbance.
Chinese: 本文揭示了DRAM读取干扰在实验表征与器件级研究之间的关键不一致性,指出当前方法存在局限性,呼吁未来研究建立更根本全面的理解。
English: This paper identifies inconsistencies between experimental and device-level studies of DRAM read disturbance, suggesting limitations in current methodologies and calling for deeper investigation into error mechanisms.

Authors:Chengkai Huang, Junda Wu, Yu Xia, Zixu Yu, Ruhan Wang, Tong Yu, Ruiyi Zhang, Ryan A. Rossi, Branislav Kveton, Dongruo Zhou, Julian McAuley, Lina Yao
Title: Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models
Abstract:
Recent breakthroughs in Large Language Models (LLMs) have led to the emergence of agentic AI systems that extend beyond the capabilities of standalone models. By empowering LLMs to perceive external environments, integrate multimodal information, and interact with various tools, these agentic systems exhibit greater autonomy and adaptability across complex tasks. This evolution brings new opportunities to recommender systems (RS): LLM-based Agentic RS (LLM-ARS) can offer more interactive, context-aware, and proactive recommendations, potentially reshaping the user experience and broadening the application scope of RS. Despite promising early results, fundamental challenges remain, including how to effectively incorporate external knowledge, balance autonomy with controllability, and evaluate performance in dynamic, multimodal settings. In this perspective paper, we first present a systematic analysis of LLM-ARS: (1) clarifying core concepts and architectures; (2) highlighting how agentic capabilities -- such as planning, memory, and multimodal reasoning -- can enhance recommendation quality; and (3) outlining key research questions in areas such as safety, efficiency, and lifelong personalization. We also discuss open problems and future directions, arguing that LLM-ARS will drive the next wave of RS innovation. Ultimately, we foresee a paradigm shift toward intelligent, autonomous, and collaborative recommendation experiences that more closely align with users' evolving needs and complex decision-making processes.
中文: 基于大语言模型的智能代理系统为推荐系统带来了交互式和情境感知的新机遇,但在外部知识整合与动态评估方面仍面临挑战。
English: Recent advancements in LLM-based agentic AI systems enable interactive, context-aware recommendations, but challenges remain in external knowledge integration and performance evaluation for recommender systems.

Authors:Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Zhenwei Shi
Title: DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding
Abstract:
The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).
中文摘要:提出的DynamicVis模型通过动态区域感知框架解决了遥感图像分析中的局限性,能够高效处理高分辨率数据,并通过元嵌入训练增强跨任务泛化能力。
English Summary: The proposed DynamicVis model addresses limitations in remote sensing image analysis by introducing a dynamic region perception framework that efficiently processes high-resolution data while enhancing cross-task generalization through meta-embedding training.

Authors:Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei
Title: PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
Abstract:
Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.
中文摘要:当前音频驱动说话人脸生成技术虽在唇部同步上取得进展,但对面部动画如说话风格和情感表达的控制不足,导致输出单一;本文提出PC-Talk框架,通过隐式关键点变形实现唇音对齐和情感控制,提升说话视频的多样性和真实感。
English Summary: Recent audio-driven talking face generation has improved lip sync but lacks control over facial expressions, leading to uniform outputs; this paper introduces PC-Talk, a framework that enhances lip-audio alignment and emotion control for more diverse and realistic talking videos.

Authors:Yuxuan Jiang, Chengxi Zeng, Siyue Teng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull
Title: C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales
Abstract:
In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at www.github.com.
中文:提出的C2D-ISR框架通过连续尺度训练和分层编码机制优化注意力超分辨率模型,在保持高性能的同时显著提升了推理速度。
English: The proposed C2D-ISR framework enhances attention-based super-resolution models through continuous-scale training and hierarchical encoding, achieving superior performance and faster inference compared to existing methods.

Authors:Mo Zhou, Jianwei Wang, Xuanmeng Zhang, Dylan Campbell, Kai Wang, Long Yuan, Wenjie Zhang, Xuemin Lin
Title: ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation
Abstract:
This paper studies optical flow estimation, a critical task in motion analysis with applications in autonomous navigation, action recognition, and film production. Traditional optical flow methods require consecutive frames, which are often unavailable due to limitations in data acquisition or real-world scene disruptions. Thus, single-frame optical flow estimation is emerging in the literature. However, existing single-frame approaches suffer from two major limitations: (1) they rely on labeled training data, making them task-specific, and (2) they produce deterministic predictions, failing to capture motion uncertainty. To overcome these challenges, we propose ProbDiffFlow, a training-free framework that estimates optical flow distributions from a single image. Instead of directly predicting motion, ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates diverse plausible future frames using a diffusion-based model, then estimates motion from these synthesized samples using a pre-trained optical flow model, and finally aggregates the results into a probabilistic flow distribution. This design eliminates the need for task-specific training while capturing multiple plausible motions. Experiments on both synthetic and real-world datasets demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and efficiency, outperforming existing single-image and two-frame baselines.
中文: 本文提出ProbDiffFlow框架,无需训练即可从单张图像估计光流概率分布,通过生成多样化未来帧并聚合运动预测,克服现有方法依赖标注数据和确定性输出的局限。
English: This paper introduces ProbDiffFlow, a training-free framework that estimates probabilistic optical flow distributions from a single image by synthesizing diverse future frames and aggregating motion predictions, overcoming limitations of existing methods that require labeled data and deterministic outputs.

Authors:Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky
Title: Visualizing Thought: Conceptual Diagrams Enable Robust Combinatorial Planning in LMMs
Abstract:
Human reasoning relies on constructing and manipulating mental models -- simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (e.g., a sketch drawn by a human to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture how entities interact with each other. In contrast, Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text, limiting their effectiveness in complex multi-step tasks. In this paper, we propose Visual Thinking, a zero-shot framework that enables LMMs to reason through multiple chains of (self-generated) conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond the natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized Graph-of-Thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves LMMs' performance (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld) and consistently outperforms other text-only search-based inference methods. On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (e.g., 16 percentage points improvement in Floor Tiles). These results highlight the value of conceptual diagrams as a reasoning medium in LMMs.
Chinese: Visual Thinking框架通过让大型多模态模型利用自生成的概念图进行推理,无需人工干预即可显著提升其组合规划能力,在复杂任务中全面超越纯文本推理方法。
English: The Visual Thinking framework enables Large MultiModal Models to reason through self-generated conceptual diagrams, significantly boosting their planning capabilities without human input and outperforming text-only methods in complex tasks.

Authors:Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky
Title: Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
Abstract:
Human reasoning relies on constructing and manipulating mental models -- simplified internal representations of situations used to understand and solve problems. Conceptual diagrams (e.g., a sketch drawn to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture how entities interact. In contrast, Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text, limiting their effectiveness on complex multi-step tasks. In this paper, we propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach requires no human input beyond the natural language description of the task. It integrates textual and diagrammatic reasoning within an optimized Graph-of-Thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves LMM performance (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld) and consistently outperforms text-only search-based inference methods. On more difficult domains with solution depths up to 40, it also surpasses the o1-preview reasoning model (e.g., 16 percentage points improvement in Floor Tiles). These results demonstrate the power of conceptual diagrams as a reasoning medium in LMMs.
Chinese: Visual Thinking框架通过让大型多模态模型利用自生成的概念图进行推理,无需人工干预即可显著提升其组合规划能力,在复杂任务中全面超越纯文本推理方法。
English: The Visual Thinking framework enables Large MultiModal Models to reason through self-generated conceptual diagrams, significantly boosting their planning capabilities without human input and outperforming text-only methods in complex tasks.

Authors:Jiachi Chen, Zhenzhe Shao, Shuo Yang, Yiming Shen, Yanlin Wang, Ting Chen, Zhenyu Shan, Zibin Zheng
Title: NumScout: Unveiling Numerical Defects in Smart Contracts using LLM-Pruning Symbolic Execution
Abstract:
In recent years, the Ethereum platform has witnessed a proliferation of smart contracts, accompanied by exponential growth in total value locked (TVL). High-TVL smart contracts often require complex numerical computations, particularly in mathematical financial models used by many decentralized applications (DApps). Improper calculations can introduce numerical defects, posing potential security risks. Existing research primarily focuses on traditional numerical defects like integer overflow, and there is currently a lack of systematic research and effective detection methods targeting new types of numerical defects. In this paper, we identify five new types of numerical defects through the analysis of 1,199 audit reports by utilizing the open card method. Each defect is defined and illustrated with a code example to highlight its features and potential consequences. We also propose NumScout, a symbolic execution-based tool designed to detect these five defects. Specifically, the tool combines information from source code and bytecode, analyzing key operations such as comparisons and transfers, to effectively locate defects and report them based on predefined detection patterns. Furthermore, NumScout uses a large language model (LLM) to prune functions which are unrelated to numerical operations. This step allows symbolic execution to quickly enter the target function and improve runtime speed by 28.4%. We run NumScout on 6,617 real-world contracts and evaluated its performance based on manually labeled results. We find that 1,774 contracts contained at least one of the five defects, and the tool achieved an overall precision of 89.7%.
中文摘要:本文通过审计报告分析识别出以太坊智能合约中五类新型数值缺陷,并提出结合符号执行与大语言模型剪枝的检测工具NumScout,在实际合约测试中达到89.7%的准确率。
English Summary: This paper identifies five new types of numerical defects in Ethereum smart contracts through audit report analysis and introduces NumScout, a symbolic execution-based detection tool enhanced with LLM pruning that achieves 89.7% precision in real-world testing.

Authors:Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng
Title: Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Abstract:
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
Chinese Summary: 本文提出解耦世界模型(DisWM),通过离线到在线的潜在蒸馏和灵活解耦约束,从干扰视频中迁移语义知识,从而在视觉强化学习中提高样本效率。
English Summary: This paper introduces Disentangled World Models (DisWM), an interpretable model-based reinforcement learning framework that enhances sample efficiency by transferring semantic knowledge from distracting videos through offline-to-online latent distillation and flexible disentanglement constraints.

Authors:Mengke Zhang, Zhihao Tian, Yaoguang Xia, Chao Xu, Fei Gao, Yanjun Cao
Title: Efficient Trajectory Generation Based on Traversable Planes in 3D Complex Architectural Spaces
Abstract:
With the increasing integration of robots into human life, their role in architectural spaces where people spend most of their time has become more prominent. While motion capabilities and accurate localization for automated robots have rapidly developed, the challenge remains to generate efficient, smooth, comprehensive, and high-quality trajectories in these areas. In this paper, we propose a novel efficient planner for ground robots to autonomously navigate in large complex multi-layered architectural spaces. Considering that traversable regions typically include ground, slopes, and stairs, which are planar or nearly planar structures, we simplify the problem to navigation within and between complex intersecting planes. We first extract traversable planes from 3D point clouds through segmenting, merging, classifying, and connecting to build a plane-graph, which is lightweight but fully represents the traversable regions. We then build a trajectory optimization based on motion state trajectory and fully consider special constraints when crossing multi-layer planes to maximize the robot's maneuverability. We conduct experiments in simulated environments and test on a CubeTrack robot in real-world scenarios, validating the method's effectiveness and practicality.
中文摘要:本文提出一种针对地面机器人的高效导航规划器,通过将可通行区域表示为相互连接的平面图,并考虑多层平面穿越的特殊约束进行轨迹优化,实现在复杂多层建筑空间中的自主导航。
English Summary: This paper introduces an efficient navigation planner for ground robots that simplifies movement in complex multi-layered architectural spaces by representing traversable areas as interconnected planes and optimizing trajectories with special multi-layer constraints.

Authors:Qiang Zhu, Yuxuan Jiang, Shuyuan Zhu, Fan Zhang, David Bull, Bing Zeng
Title: Blind Video Super-Resolution based on Implicit Kernels
Abstract:
Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at https://github.com.
中文: 本文提出了一种基于隐式核的新盲视频超分辨率模型BVSR-IK,它通过隐式神经表示构建多尺度核字典,并采用循环Transformer处理时空变化退化,在性能上超越现有最佳方法达0.59 dB的PSNR提升。
English: The paper introduces BVSR-IK, a novel blind video super-resolution model that uses implicit neural representations to create a multi-scale kernel dictionary and a recurrent Transformer for handling spatio-temporal varying degradations, achieving superior performance over existing methods by up to 0.59 dB in PSNR.

Authors:Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
Title: ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning
Abstract:
Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with a simple architecture, contributing to the development of advanced and automated RL systems.
中文: 深度强化学习中的超参数优化因高成本与低效而充满挑战,但提出的ULTHO框架通过多臂老虎机与聚类机制,以轻量级设计实现了高效优异的性能。
English: Hyperparameter optimization in deep reinforcement learning is challenging due to high costs and inefficiency, but the proposed ULTHO framework offers a lightweight solution using multi-armed bandit with clustered arms to achieve superior performance efficiently.

Authors:Xinyi Hou, Yanjie Zhao, Haoyu Wang
Title: The Next Frontier of LLM Applications: Open Ecosystems and Hardware Synergy
Abstract:
Large Language Model (LLM) applications, including LLM app stores and autonomous agents, are shaping the future of AI ecosystems. However, platform silos, fragmented hardware integration, and the absence of standardized interfaces limit scalability, interoperability, and resource efficiency. While LLM app stores democratize AI, their closed ecosystems restrict modular AI reuse and cross-platform portability. Meanwhile, agent-based frameworks offer flexibility but often lack seamless integration across diverse environments. This paper envisions the future of LLM applications and proposes a three-layer decoupled architecture grounded in software engineering principles such as layered system design, service-oriented architectures, and hardware-software co-design. This architecture separates application logic, communication protocols, and hardware execution, enhancing modularity, efficiency, and cross-platform compatibility. Beyond architecture, we highlight key security and privacy challenges for safe, scalable AI deployment and outline research directions in software and security engineering. This vision aims to foster open, secure, and interoperable LLM ecosystems, guiding future advancements in AI applications.
中文: 本文综述了LLM的四大应用范式——应用商店、智能体、自托管服务和终端设备,分析其架构、生态与挑战,提出基础设施、协议和应用三层框架以提升互操作性、安全性和可扩展性,并展望了未来研究方向。
English: This paper reviews four major LLM application paradigms—app stores, agents, self-hosted services, and powered devices—analyzing their architectures, ecosystems, and challenges, while proposing a three-layer framework (infrastructure, protocol, application) to enhance interoperability, security, and scalability, alongside future research directions.

Authors:Xinyi Hou, Yanjie Zhao, Haoyu Wang
Title: LLM Applications: Current Paradigms and the Next Frontier
Abstract:
The development of large language models (LLMs) has given rise to four major application paradigms: LLM app stores, LLM agents, self-hosted LLM services, and LLM-powered devices. Each has its advantages but also shares common challenges. LLM app stores lower the barrier to development but lead to platform lock-in; LLM agents provide autonomy but lack a unified communication mechanism; self-hosted LLM services enhance control but increase deployment complexity; and LLM-powered devices improve privacy and real-time performance but are limited by hardware. This paper reviews and analyzes these paradigms, covering architecture design, application ecosystem, research progress, as well as the challenges and open problems they face. Based on this, we outline the next frontier of LLM applications, characterizing them through three interconnected layers: infrastructure, protocol, and application. We describe their responsibilities and roles of each layer and demonstrate how to mitigate existing fragmentation limitations and improve security and scalability. Finally, we discuss key future challenges, identify opportunities such as protocol-driven cross-platform collaboration and device integration, and propose a research roadmap for openness, security, and sustainability.
中文: 本文综述了LLM的四大应用范式——应用商店、智能体、自托管服务和终端设备,分析其架构、生态与挑战,提出基础设施、协议和应用三层框架以提升互操作性、安全性和可扩展性,并展望了未来研究方向。
English: This paper reviews four major LLM application paradigms—app stores, agents, self-hosted services, and powered devices—analyzing their architectures, ecosystems, and challenges, while proposing a three-layer framework (infrastructure, protocol, application) to enhance interoperability, security, and scalability, alongside future research directions.

Authors:Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Title: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Abstract:
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
中文: Audio Flamingo 2 (AF2) 作为先进的音频语言模型,通过3B参数架构在音频理解与推理任务中实现最优性能,并借助创新的LongAudio数据集将处理能力扩展至长音频领域。
English: Audio Flamingo 2 (AF2) is a state-of-the-art audio-language model that achieves superior performance in audio understanding and reasoning using a compact 3B parameter architecture and extends capabilities to long audio processing through the novel LongAudio dataset.

Authors:Sha Ye, Qiong Wu, Pingyi Fan, Qiang Fan
Title: A Survey on Semantic Communications in Internet of Vehicles
Abstract:
Internet of Vehicles (IoV), as the core of intelligent transportation system, enables comprehensive interconnection between vehicles and their surroundings through multiple communication modes, which is significant for autonomous driving and intelligent traffic management. However, with the emergence of new applications, traditional communication technologies face the problems of scarce spectrum resources and high latency. Semantic communication, which focuses on extracting, transmitting, and recovering some useful semantic information from messages, can reduce redundant data transmission, improve spectrum utilization, and provide innovative solutions to communication challenges in the IoV. This paper systematically reviews state of art of semantic communications in the IoV, elaborates the technical background of IoV and semantic communications, and deeply discusses key technologies of semantic communications in IoV, including semantic information extraction, semantic communication architecture, resource allocation and management, and so on. Through specific case studies, it demonstrates that semantic communications can be effectively employed in the scenarios of traffic environment perception and understanding, intelligent driving decision support, IoV service optimization, and intelligent traffic management. Additionally, it analyzes the current challenges and future research directions. This survey reveals that semantic communications has broad application prospects in IoV, but it is necessary to solve the real existing problems by combining advanced technologies to promote its wide application in IoV and contributing to the development of intelligent transportation system.
中文: 语义通信通过提取和传输关键语义信息,有效应对车联网中的频谱资源紧张和延迟问题,为自动驾驶和智能交通管理提供高效解决方案,尽管仍需克服实际应用中的挑战。
English: Semantic communication offers a promising solution to spectrum scarcity and latency issues in the Internet of Vehicles by focusing on essential information extraction and transmission, enhancing efficiency for autonomous driving and traffic management despite existing challenges.

Authors:Masoumeh Sharafi, Emma Ollivier, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger
Title: Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data
Abstract:
Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health diagnosis and monitoring (e.g., assessing pain and depression). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable inter-subject variability in expressions. Source-free (unsupervised) domain adaptation (SFDA) methods may be employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (which displays only neutral expressions) from target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled SFDA (DSFDA) method to address the challenge posed by adapting models with missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral expression data for the target subject, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression.
中文: 本文提出解耦无源域自适应方法,通过利用中性表情控制视频生成并适配缺失的非中性表情数据,在分离身份与表情特征的同时提升模型在目标域数据不完整情况下的面部表情识别准确率。
English: This paper introduces the Disentangled SFDA method, which addresses the challenge of adapting facial expression recognition models when target expression data is missing by leveraging neutral control videos to generate and adapt missing non-neutral expressions while disentangling identity and expression features for improved accuracy.

Authors:Yucheng Suo, Fan Ma, Linchao Zhu, Tianyi Wang, Fengyun Rao, Yi Yang
Title: From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Abstract:
Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.
中文摘要:针对多模态大语言模型处理长视频时因帧数限制导致的视觉信息缺失问题,我们提出通过视觉上下文采样生成多组预测,并结合频率、置信度和推理三类评分的选择机制,实验证明该方法能有效提升三个模型在七个数据集上的长视频问答性能。
English Summary: Multi-modal large language models struggle with long videos due to limited frame processing, so we propose a method using visual context sampling and a scoring mechanism to generate and select optimal predictions, significantly improving performance across multiple datasets.

Authors:Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie
Title: ViLBench: A Suite for Vision-Language Process Reward Modeling
Abstract:
Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.
Chinese: 本文对视觉语言模型作为奖励模型进行了基准测试,发现其在不同任务中表现不一,并引入ViLBench评估过程奖励,表明即使先进模型也面临挑战,同时证明利用过程数据进行针对性训练可有效提升性能。
English: This paper benchmarks vision-language models as reward models, revealing inconsistent performance across tasks, and introduces ViLBench to evaluate process rewards, showing that even advanced models struggle while demonstrating that targeted training with process data can enhance performance.

Authors:Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Title: Scene-agnostic Pose Regression for Visual Localization
Abstract:
Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. Visual Odometry (VO) generalizes well in unseen environments but suffers from accumulated error in open trajectories. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at three different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. Extensive experiments and studies demonstrate the effectiveness of our SPR paradigm, dataset, and model. In the unknown scenes of both 360SPR and 360Loc datasets, our method consistently outperforms APR, RPR and VO. The dataset and code are available at https://junweizheng93.github.io/publications/SPR/SPR.html.
中文: 本文提出场景无关姿态回归(SPR),通过无需重新训练或数据库的灵活方法实现精确相机姿态估计,并基于新构建的360SPR数据集和双分支SPR-Mamba模型,在未知场景中持续优于现有技术。
English: The paper introduces Scene-agnostic Pose Regression (SPR), a flexible method that achieves accurate camera pose estimation without retraining or databases, supported by the new 360SPR dataset and a dual-branch SPR-Mamba model, outperforming existing techniques in unknown environments.

Authors:Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang
Title: Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy
Abstract:
Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.
中文摘要:本文提出首个基于物理的人-物交互统一框架,利用视觉语言模型自动生成运动指导和奖励函数,无需人工调整即可在不同交互场景中实现自然拟人化运动。
English Summary: This paper introduces a unified physics-based human-object interaction framework using Vision-Language Models to automatically generate motion guidance and reward functions, enabling natural human-like motions across diverse interaction scenarios without manual tuning.

Authors:Qingyue Long, Can Rong, Huandong Wang, Shaw Rajib, Yong Li
Title: DiffMove: Group Mobility Tendency Enhanced Trajectory Recovery via Diffusion Model
Abstract:
In the real world, trajectory data is often sparse and incomplete due to low collection frequencies or limited device coverage. Trajectory recovery aims to recover these missing trajectory points, making the trajectories denser and more complete. However, this task faces two key challenges: 1) The excessive sparsity of individual trajectories makes it difficult to effectively leverage historical information for recovery; 2) Sparse trajectories make it harder to capture complex individual mobility preferences. To address these challenges, we propose a novel method called DiffMove. Firstly, we harness crowd wisdom for trajectory recovery. Specifically, we construct a group tendency graph using the collective trajectories of all users and then integrate the group mobility trends into the location representations via graph embedding. This solves the challenge of sparse trajectories being unable to rely on individual historical trajectories for recovery. Secondly, we capture individual mobility preferences from both historical and current perspectives. Finally, we integrate group mobility tendencies and individual preferences into the spatiotemporal distribution of the trajectory to recover high-quality trajectories. Extensive experiments on two real-world datasets demonstrate that DiffMove outperforms existing state-of-the-art methods. Further analysis validates the robustness of our method.
Chinese: DiffMove通过整合群体移动趋势与个人偏好到轨迹的时空分布中,有效解决了稀疏轨迹恢复问题,在实际数据集测试中超越了现有最优方法。
English: DiffMove addresses sparse trajectory recovery by integrating crowd-sourced mobility trends and individual preferences into spatiotemporal distributions, outperforming existing methods in real-world experiments.

Authors:Xiaodan Shao, Weidong Mei, Changsheng You, Qingqing Wu, Beixiong Zheng, Cheng-Xiang Wang, Junling Li, Rui Zhang, Robert Schober, Lipeng Zhu, Weihua Zhuang, Xuemin Shen
Title: A Tutorial on Six-Dimensional Movable Antenna for 6G Networks: Synergizing Positionable and Rotatable Antennas
Abstract:
Six-dimensional movable antenna (6DMA) is a new and revolutionary technique that fully exploits the wireless channel spatial variations at the transmitter/receiver by flexibly adjusting the three-dimensional (3D) positions and/or 3D rotations of antennas/antenna surfaces (sub-arrays), thereby improving the performance of wireless networks cost-effectively without the need to deploy additional antennas. It is thus expected that the integration of new 6DMAs into future sixth-generation (6G) wireless networks will fundamentally enhance antenna agility and adaptability, and introduce new degrees of freedom (DoFs) for system design. Despite its great potential, 6DMA faces new challenges to be efficiently implemented in wireless networks, including corresponding architectures, antenna position and rotation optimization, channel estimation, and system design from both communication and sensing perspectives. In this paper, we provide a tutorial on 6DMA-enhanced wireless networks to address the above issues by unveiling associated new channel models, hardware implementations and practical position/rotation constraints, as well as various appealing applications in wireless networks. Moreover, we discuss two special cases of 6DMA, namely, rotatable 6DMA with fixed antenna position and positionable 6DMA with fixed antenna rotation, and highlight their respective design challenges and applications. We further present prototypes developed for 6DMA-enhanced communication along with experimental results obtained with these prototypes. Finally, we outline promising directions for further investigation.
中文: 六维可动天线(6DMA)是一项革命性技术,通过灵活调整天线的三维位置和旋转来提升无线网络性能,为6G系统引入新自由度,但需解决架构设计和优化等实施难题。
English: Six-dimensional movable antenna (6DMA) is a transformative technology that enhances wireless network performance by dynamically adjusting antennas' positions and rotations, offering new design freedoms for 6G systems while facing implementation challenges in architecture and optimization.

Authors:Yibo Yan, Shen Wang, Jiahao Huo, Philip S. Yu, Xuming Hu, Qingsong Wen
Title: MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection
Abstract:
Mathematical error detection in educational settings presents a significant challenge for Multimodal Large Language Models (MLLMs), requiring a sophisticated understanding of both visual and textual mathematical content along with complex reasoning capabilities. Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. Our approach decomposes error detection into three phases, each handled by a specialized agent: an image-text consistency validator, a visual semantic interpreter, and an integrative error analyzer. This architecture enables more accurate processing of mathematical content by explicitly modeling relationships between multimodal problems and student solution steps. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification and 3% improvement in error categorization compared to baseline models. Besides, MathAgent has been successfully deployed in an educational platform that has served over one million K-12 students, achieving nearly 90% student satisfaction while generating significant cost savings by reducing manual error detection.
中文: MathAgent是一种专为多模态数学错误检测设计的混合代理框架,通过分解为三个代理处理阶段,在错误步骤识别和分类上实现更高准确率,并已成功应用于教育平台,显著节省成本且获得近90%的学生满意度。
English: MathAgent is a specialized Mixture-of-Math-Agent framework that enhances multimodal mathematical error detection by decomposing it into three agent-managed phases, achieving higher accuracy in step identification and categorization while being successfully deployed in educational platforms with significant cost savings and student satisfaction.

Authors:Wentao Jiang, Jingya Wang, Kaiyang Ji, Baoxiong Jia, Siyuan Huang, Ye Shi
Title: ARFlow: Human Action-Reaction Flow Matching with Physical Guidance
Abstract:
Human action-reaction synthesis, a fundamental challenge in modeling causal human interactions, plays a critical role in applications ranging from virtual reality to social robotics. While diffusion-based models have demonstrated promising performance, they exhibit two key limitations for interaction synthesis: reliance on complex noise-to-reaction generators with intricate conditional mechanisms, and frequent physical violations in generated motions. To address these issues, we propose Action-Reaction Flow Matching (ARFlow), a novel framework that establishes direct action-to-reaction mappings, eliminating the need for complex conditional mechanisms. Our approach introduces a physical guidance mechanism specifically designed for Flow Matching (FM) that effectively prevents body penetration artifacts during sampling. Moreover, we discover the bias of traditional flow matching sampling algorithm and employ a reprojection method to revise the sampling direction of FM. To further enhance the reaction diversity, we incorporate randomness into the sampling process. Extensive experiments on NTU120, Chi3D and InterHuman datasets demonstrate that ARFlow not only outperforms existing methods in terms of Fréchet Inception Distance and motion diversity but also significantly reduces body collisions, as measured by our new Intersection Volume and Intersection Frequency metrics.
中文摘要:提出的ARFlow框架通过建立直接的动作-反应映射关系并引入物理引导机制,克服了扩散模型的局限性,在运动质量和多样性方面表现优异,同时显著减少了身体碰撞。
English Summary: The proposed ARFlow framework overcomes limitations of diffusion models by establishing direct action-to-reaction mappings with physical guidance, achieving superior performance in motion quality and diversity while significantly reducing body collisions.

Authors:Jingying Zeng, Zhenwei Dai, Hui Liu, Samarth Varshney, Zhiji Liu, Chen Luo, Zhen Li, Qi He, Xianfeng Tang
Title: Examples as the Prompt: A Scalable Approach for Efficient LLM Adaptation in E-Commerce
Abstract:
Prompting LLMs offers an efficient way to guide output generation without explicit model training. In the e-commerce domain, prompting-based applications are widely used for tasks such as query understanding, recommender systems, and customer support. However, adapting LLMs to different tasks often requires extensive prompt engineering by domain experts, along with frequent updates to align with evolving business needs. Additionally, crafting fully unbiased natural language prompts remains a challenge for humans. To address these challenges, we propose a novel framework, Examples as the Prompt (EaP) which leverages labeled data to enhance prompts. Specifically, EaP automatically selects the most representative examples to maximize the few-shot capability of LLMs. It is efficient due to its unsupervised example selection and adaptive to potential data distribution shifts. We validate EaP on four real-world production use cases, demonstrating that it achieves comparable or even superior performance comparing to hand-crafted prompts designed by domain experts. Additionally, we introduce EaP_lite, which entirely replaces the natural language components of prompts with labeled examples. EaP_lite improves LLM inference speed by up to 70% without compromising performance. Latest online A/B test shows that using EaP and EaP_lite for data labeling can bring significant composite revenue gain by 0.06%.
中文:提出的“示例即提示”(EaP)框架通过自动选择代表性示例来提升大语言模型性能,在实现与专家设计提示相当或更优效果的同时,提高了推理速度并带来收益增长。
English: The proposed Examples as the Prompt (EaP) framework automatically selects representative examples to enhance LLM performance, achieving comparable or superior results to expert-crafted prompts while improving inference speed and revenue gains.

Authors:Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang
Title: Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Abstract:
Visual instructions for long-horizon tasks are crucial as they intuitively clarify complex concepts and enhance retention across extended steps. Directly generating a series of images using text-to-image models without considering the context of previous steps results in inconsistent images, increasing cognitive load. Additionally, the generated images often miss objects or the attributes such as color, shape, and state of the objects are inaccurate. To address these challenges, we propose LIGER, the first training-free framework for Long-horizon Instruction GEneration with logic and attribute self-Reflection. LIGER first generates a draft image for each step with the historical prompt and visual memory of previous steps. This step-by-step generation approach maintains consistency between images in long-horizon tasks. Moreover, LIGER utilizes various image editing tools to rectify errors including wrong attributes, logic errors, object redundancy, and identity inconsistency in the draft images. Through this self-reflection mechanism, LIGER improves the logic and object attribute correctness of the images. To verify whether the generated images assist human understanding, we manually curated a new benchmark consisting of various long-horizon tasks. Human-annotated ground truth expressions reflect the human-defined criteria for how an image should appear to be illustrative. Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods.
中文: LIGER是一种无需训练的框架,通过利用历史上下文和自反思机制纠正错误,为长时程任务生成一致且准确的视觉指令,在人类评估中优于基线方法。
English: LIGER is a training-free framework that generates consistent and accurate visual instructions for long-horizon tasks by using historical context and self-reflection to correct errors, outperforming baseline methods in human evaluations.

Authors:Mengyao Lyu, Yan Li, Huasong Zhong, Wenhao Yang, Hui Chen, Jungong Han, Guiguang Ding, Zhenheng Yang
Title: Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning
Abstract:
The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.
中文摘要:近期研究表明大语言模型仅需少量监督即可微调,但其稳定性受实验设置影响,为此我们提出多模态数据选择方法mmSSR,通过评估14种视觉语言能力和交互模式,仅用30%数据即可实现99.1%的完整性能表现。
English Summary: Recent research suggests that large language models require minimal fine-tuning supervision, but their stability is compromised by experimental vulnerabilities, prompting the development of a multi-modal data selection method called mmSSR that evaluates 14 vision-language capabilities and interaction styles to efficiently achieve 99.1% performance with only 30% of data.

Authors:Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Title: One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation
Abstract:
Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.
中文:RSD是一种针对ResShift的新型蒸馏方法,实现了单步超分辨率重建,在感知质量上超越教师模型和竞争方法,同时资源消耗更低。
English: RSD is a novel distillation method for ResShift that enables single-step super-resolution with superior perceptual quality, outperforming the teacher model and competing methods while using fewer resources.

Authors:Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Title: One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation
Abstract:
Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.
中文:RSD是一种针对ResShift的新型蒸馏方法,实现了单步超分辨率重建,在感知质量上超越教师模型和竞争方法,同时资源消耗更低。
English: RSD is a novel distillation method for ResShift that enables single-step super-resolution with superior perceptual quality, outperforming the teacher model and competing methods while using fewer resources.

Authors:Roba Al Majzoub, Hashmat Malik, Muzammal Naseer, Zaigham Zaheer, Tariq Mahmood, Salman Khan, Fahad Khan
Title: How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark
Abstract:
Recently, histopathology vision-language foundation models (VLMs) have gained popularity due to their enhanced performance and generalizability across different downstream tasks. However, most existing histopathology benchmarks are either unimodal or limited in terms of diversity of clinical tasks, organs, and acquisition instruments, as well as their partial availability to the public due to patient data privacy. As a consequence, there is a lack of comprehensive evaluation of existing histopathology VLMs on a unified benchmark setting that better reflects a wide range of clinical scenarios. To address this gap, we introduce HistoVL, a fully open-source comprehensive benchmark comprising images acquired using up to 11 various acquisition tools that are paired with specifically crafted captions by incorporating class names and diverse pathology descriptions. Our Histo-VL includes 26 organs, 31 cancer types, and a wide variety of tissue obtained from 14 heterogeneous patient cohorts, totaling more than 5 million patches obtained from over 41K WSIs viewed under various magnification levels. We systematically evaluate existing histopathology VLMs on Histo-VL to simulate diverse tasks performed by experts in real-world clinical scenarios. Our analysis reveals interesting findings, including large sensitivity of most existing histopathology VLMs to textual changes with a drop in balanced accuracy of up to 25% in tasks such as Metastasis detection, low robustness to adversarial attacks, as well as improper calibration of models evident through high ECE values and low model prediction confidence, all of which can affect their clinical implementation.
中文: 组织病理学视觉语言模型因缺乏全面评估基准,故推出开源基准HistoVL,其包含多样化临床数据,揭示模型对文本变化敏感且在真实任务中鲁棒性不足的问题。
English: Histopathology vision-language models lack comprehensive evaluation due to limited benchmarks, so HistoVL is introduced as an open-source benchmark with diverse clinical data, revealing models' sensitivity to text changes and low robustness in real-world tasks.

Authors:Zijian He, Yuwei Ning, Yipeng Qin, Guangrun Wang, Sibei Yang, Liang Lin, Guanbin Li
Title: VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
Abstract:
Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the equivalence between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views. To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach. Project page: https://scnuhealthy.github.io/VTON360.
中文:VTON 360提出了一种创新的3D虚拟试穿方法,通过扩展2D模型为多视图输入,并结合伪3D姿态表征、多视角注意力机制和增强CLIP嵌入等技术,实现了支持任意视角的高保真渲染。
English: VTON 360 introduces a novel 3D virtual try-on method that achieves high-fidelity, any-view rendering by extending 2D VTON models with multi-view inputs and innovative techniques including pseudo-3D pose representation, multi-view attention mechanisms, and enhanced CLIP embeddings.

Authors:Zhendong Chu, Shen Wang, Jian Xie, Tinghui Zhu, Yibo Yan, Jinheng Ye, Aoxiao Zhong, Xuming Hu, Jing Liang, Philip S. Yu, Qingsong Wen
Title: LLM Agents for Education: Advances and Applications
Abstract:
Large Language Model (LLM) agents have demonstrated remarkable capabilities in automating tasks and driving innovation across diverse educational applications. In this survey, we provide a systematic review of state-of-the-art research on LLM agents in education, categorizing them into two broad classes: (1) \emph{Pedagogical Agents}, which focus on automating complex pedagogical tasks to support both teachers and students; and (2) \emph{Domain-Specific Educational Agents}, which are tailored for specialized fields such as science education, language learning, and professional development. We comprehensively examine the technological advancements underlying these LLM agents, including key datasets, benchmarks, and algorithmic frameworks that drive their effectiveness. Furthermore, we discuss critical challenges such as privacy, bias and fairness concerns, hallucination mitigation, and integration with existing educational ecosystems. This survey aims to provide a comprehensive technological overview of LLM agents for education, fostering further research and collaboration to enhance their impact for the greater good of learners and educators alike.
中文: 本综述系统梳理了大语言模型智能体在教育领域的应用进展,将其划分为教学辅助与专业领域两类,并针对隐私保护、算法偏见等关键挑战提出见解,以推动该领域研究发展。
English: This survey systematically reviews the advancements of Large Language Model agents in education, categorizing them into pedagogical and domain-specific agents while addressing key challenges like privacy and bias to guide future research.

Authors:Zengyu Wan, Wei Zhai, Yang Cao, Zhengjun Zha
Title: EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
Abstract:
Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
中文摘要:EMoTive是一种基于事件的新型框架,通过事件引导的非均匀参数曲线建模时空轨迹来解决运动不一致性,并在合成数据集和真实基准测试中验证了有效性。
English Summary: EMoTive is an event-based framework that models 3D motion using event-guided parametric curves to address spatio-temporal inconsistencies, validated through a new synthetic dataset and real-world benchmarks.

Authors:Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, Liang Lin
Title: Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
Abstract:
Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier-based exploration strategies struggle in cluttered environments and fail to ensure fine-grained exploration of task-relevant areas. To address these challenges, we construct the EXPloration-awaRe Embodied queStion anSwering Benchmark (EXPRESS-Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS-Bench consists of 777 exploration trajectories and 2,044 question-trajectory pairs. To improve exploration efficiency, we propose Fine-EQA, a hybrid exploration model that integrates frontier-based and goal-oriented navigation to guide agents toward task-relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration-Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state-of-the-art EQA models demonstrate the effectiveness of our EXPRESS-Bench in advancing embodied exploration and question reasoning.
Chinese: 本研究提出了EXPRESS-Bench数据集,通过提升探索效率和推理评估来解决具身问答中的关键局限,并开发了Fine-EQA模型与EAC评估指标,显著提高了智能体在三维环境中的任务执行能力。
English: The study introduces EXPRESS-Bench, a comprehensive dataset addressing limitations in Embodied Question Answering by enhancing exploration efficiency and reasoning evaluation, and proposes the Fine-EQA model and EAC metric to improve agent performance in 3D environments.

Authors:Zhiyu Mou, Miao Xu, Rongquan Bai, Zhuoran Yang, Chuan Yu, Jian Xu, Bo Zheng
Title: Nash Equilibrium Constrained Auto-bidding With Bi-level Reinforcement Learning
Abstract:
Many online advertising platforms provide advertisers with auto-bidding services to enhance their advertising performance. However, most existing auto-bidding algorithms fail to accurately capture the auto-bidding problem formulation that the platform truly faces, let alone solve it. Actually, we argue that the platform should try to help optimize each advertiser's performance to the greatest extent -- which makes $ε$-Nash Equilibrium ($ε$-NE) a necessary solution concept -- while maximizing the social welfare of all the advertisers for the platform's long-term value. Based on this, we introduce the \emph{Nash-Equilibrium Constrained Bidding} (NCB), a new formulation of the auto-bidding problem from the platform's perspective. Specifically, it aims to maximize the social welfare of all advertisers under the $ε$-NE constraint. However, the NCB problem presents significant challenges due to its constrained bi-level structure and the typically large number of advertisers involved. To address these challenges, we propose a \emph{Bi-level Policy Gradient} (BPG) framework with theoretical guarantees. Notably, its computational complexity is independent of the number of advertisers, and the associated gradients are straightforward to compute. Extensive simulated and real-world experiments validate the effectiveness of the BPG framework.
中文: 本文提出了纳什均衡约束竞价(NCB)这一新的自动竞价框架,旨在ε-纳什均衡约束下最大化广告主社会福利,并设计了计算复杂度与广告主数量无关的双层策略梯度(BPG)方法,实验验证了其有效性。
English: The paper introduces Nash-Equilibrium Constrained Bidding (NCB), a novel auto-bidding formulation that maximizes advertiser social welfare under ε-Nash Equilibrium constraints, and proposes a computationally efficient Bi-level Policy Gradient (BPG) framework with proven effectiveness in experiments.

Authors:Xinghan Li, Jingjing Chen, Yue Yu, Xue Song, Haijun Shan, Yu-Gang Jiang
Title: Revealing the Implicit Noise-based Imprint of Generative Models
Abstract:
With the rapid advancement of vision generation models, the potential security risks stemming from synthetic visual content have garnered increasing attention, posing significant challenges for AI-generated image detection. Existing methods suffer from inadequate generalization capabilities, resulting in unsatisfactory performance on emerging generative models. To address this issue, this paper presents a novel framework that leverages noise-based model-specific imprint for the detection task. Specifically, we propose a novel noise-based imprint simulator to capture intrinsic patterns imprinted in images generated by different models. By aggregating imprints from various generative models, imprints of future models can be extrapolated to expand training data, thereby enhancing generalization and robustness. Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a noise-based imprint extractor, alongside other visual features for AI-generated image detection, resulting in a significant improvement in performance. Our approach achieves state-of-the-art performance across three public benchmarks including GenImage, Synthbuster and Chameleon.
中文: 本文提出了一种新颖的框架,利用基于噪声的模型特定印记来提升AI生成图像检测的泛化能力和鲁棒性,在多个基准测试中达到了最先进的性能。
English: This paper introduces a novel framework that uses noise-based model-specific imprints to enhance the generalization and robustness of AI-generated image detection, achieving state-of-the-art performance across multiple benchmarks.

Authors:Boyang Xue, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Hongling Xu, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Title: DAST: Difficulty-Aware Self-Training on Large Language Models
Abstract:
Present Large Language Models (LLM) self-training methods always under-sample on challenging queries, leading to inadequate learning on difficult problems which limits LLMs' ability. Therefore, this work proposes a difficulty-aware self-training (DAST) framework that focuses on improving both the quantity and quality of self-generated responses on challenging queries during self-training. DAST is specified in three components: 1) sampling-based difficulty level estimation, 2) difficulty-aware data augmentation, and 3) the self-training algorithm using SFT and DPO respectively. Experiments on mathematical tasks demonstrate the effectiveness and generalization of DAST, highlighting the critical role of difficulty-aware strategies in advancing LLM self-training.
当前大语言模型自训练方法常忽视难题,因此本研究提出一种难度感知自训练框架,通过针对性数据增强和训练算法提升困难问题的回答数量与质量。
Current large language model self-training methods often neglect challenging queries, so this study introduces a difficulty-aware self-training framework to enhance both the quantity and quality of responses on difficult problems through targeted data augmentation and training algorithms.

Authors:Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
Title: HERO: Human Reaction Generation from Videos
Abstract:
Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at https://jackyu6.github.io/HERO.
中文: 本研究提出HERO框架,通过整合视频的全局和局部表征,从RGB视频生成涵盖更广交互类别和情感线索的3D人体反应,并在新收集的ViMo数据集上验证了其优越性。
English: This study introduces HERO, a framework for generating 3D human reactions from RGB videos that captures a broader range of interactions and emotional cues by integrating global and local video representations, validated on the new ViMo dataset.

Authors:Tao Feng, Yunke Zhang, Huandong Wang, Yong Li
Title: Causality Enhanced Origin-Destination Flow Prediction in Data-Scarce Cities
Abstract:
Accurate origin-destination (OD) flow prediction is of great importance to developing cities, as it can contribute to optimize urban structures and layouts. However, with the common issues of missing regional features and lacking OD flow data, it is quite daunting to predict OD flow in developing cities. To address this challenge, we propose a novel Causality-Enhanced OD Flow Prediction (CE-OFP), a unified framework that aims to transfer urban knowledge between cities and achieve accuracy improvements in OD flow predictions across data-scarce cities. In specific, we propose a novel reinforcement learning model to discover universal causalities among urban features in data-rich cities and build corresponding causal graphs. Then, we further build Causality-Enhanced Variational Auto-Encoder (CE-VAE) to incorporate causal graphs for effective feature reconstruction in data-scarce cities. Finally, with the reconstructed features, we devise a knowledge distillation method with a graph attention network to migrate the OD prediction model from data-rich cities to data-scare cities. Extensive experiments on two pairs of real-world datasets validate that the proposed CE-OFP remarkably outperforms state-of-the-art baselines, which can reduce the RMSE of OD flow prediction for data-scarce cities by up to 11%.
中文: 提出的CE-OFP框架通过因果增强的强化学习和变分自编码器,将城市知识从数据丰富城市迁移至数据稀缺城市,显著提升OD流量预测精度达11%。
English: The proposed CE-OFP framework transfers urban knowledge from data-rich to data-scarce cities using causality-enhanced reinforcement learning and variational auto-encoders, significantly improving OD flow prediction accuracy by up to 11%.

Authors:Tao Feng, Yunke Zhang, Xiaochen Fan, Huandong Wang, Yong Li
Title: Causal Discovery and Inference towards Urban Elements and Associated Factors
Abstract:
To uncover the city's fundamental functioning mechanisms, it is important to acquire a deep understanding of complicated relationships among citizens, location, and mobility behaviors. Previous research studies have applied direct correlation analysis to investigate such relationships. Nevertheless, due to the ubiquitous confounding effects, empirical correlation analysis may not accurately reflect underlying causal relationships among basic urban elements. In this paper, we propose a novel urban causal computing framework to comprehensively explore causalities and confounding effects among a variety of factors across different types of urban elements. In particular, we design a reinforcement learning algorithm to discover the potential causal graph, which depicts the causal relations between urban factors. The causal graph further serves as the guidance for estimating causal effects between pair-wise urban factors by propensity score matching. After removing the confounding effects from correlations, we leverage significance levels of causal effects in downstream urban mobility prediction tasks. Experimental studies on open-source urban datasets show that the discovered causal graph demonstrates a hierarchical structure, where citizens affect locations, and they both cause changes in urban mobility behaviors. Experimental results in urban mobility prediction tasks further show that the proposed method can effectively reduce confounding effects and enhance performance of urban computing tasks.
中文摘要:本文提出了一种创新的城市因果计算框架,通过强化学习揭示城市要素间的因果关系,有效减少混杂效应并提升城市移动预测任务的性能。
English Summary: This paper introduces a novel urban causal computing framework that employs reinforcement learning to uncover causal relationships among urban elements, effectively reducing confounding effects and enhancing performance in urban mobility prediction tasks.

Authors:Tao Feng, Yunke Zhang, Huandong Wang, Yong Li
Title: EPR-GAIL: An EPR-Enhanced Hierarchical Imitation Learning Framework to Simulate Complex User Consumption Behaviors
Abstract:
User consumption behavior data, which records individuals' online spending history at various types of stores, has been widely used in various applications, such as store recommendation, site selection, and sale forecasting. However, its high worth is limited due to deficiencies in data comprehensiveness and changes of application scenarios. Thus, generating high-quality sequential consumption data by simulating complex user consumption behaviors is of great importance to real-world applications. Two branches of existing sequence generation methods are both limited in quality. Model-based methods with simplified assumptions fail to model the complex decision process of user consumption, while data-driven methods that emulate real-world data are prone to noises, unobserved behaviors, and dynamic decision space. In this work, we propose to enhance the fidelity and trustworthiness of the data-driven Generative Adversarial Imitation Learning (GAIL) method by blending it with the Exploration and Preferential Return EPR model . The core idea of our EPR-GAIL framework is to model user consumption behaviors as a complex EPR decision process, which consists of purchase, exploration, and preference decisions. Specifically, we design the hierarchical policy function in the generator as a realization of the EPR decision process and employ the probability distributions of the EPR model to guide the reward function in the discriminator. Extensive experiments on two real-world datasets of user consumption behaviors on an online platform demonstrate that the EPR-GAIL framework outperforms the best state-of-the-art baseline by over 19\% in terms of data fidelity. Furthermore, the generated consumption behavior data can improve the performance of sale prediction and location recommendation by up to 35.29% and 11.19%, respectively, validating its advantage for practical applications.
Chinese: 提出的EPR-GAIL框架通过将探索偏好返回模型与生成对抗模仿学习相结合,显著提升了生成用户消费行为数据的质量,并在销售预测和位置推荐等实际应用中有效提高了性能。
English: The proposed EPR-GAIL framework enhances data fidelity by integrating the Exploration and Preferential Return model with Generative Adversarial Imitation Learning, significantly improving the quality of generated user consumption behavior data and boosting performance in practical applications like sales prediction and location recommendation.

Authors:Jingying Zeng, Hui Liu, Zhenwei Dai, Xianfeng Tang, Chen Luo, Samarth Varshney, Zhen Li, Qi He
Title: Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents
Abstract:
With the advancement of conversational large language models (LLMs), several LLM-based Conversational Shopping Agents (CSA) have been developed to help customers smooth their online shopping. The primary objective in building an engaging and trustworthy CSA is to ensure the agent's responses about product factoids are accurate and factually grounded. However, two challenges remain. First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk spreading misinformation and diminishing customer trust. Second, without providing knowledge source attribution in CSA response, customers struggle to verify LLM-generated information. To address both challenges, we present an easily productionized solution that enables a ''citation experience'' to our customers. We build auto-evaluation metrics to holistically evaluate LLM's grounding and attribution capabilities, suggesting that citation generation paradigm substantially improves grounding performance by 13.83%. To deploy this capability at scale, we introduce Multi-UX-Inference system, which appends source citations to LLM outputs while preserving existing user experience features and supporting scalable inference. Large-scale online A/B tests show that grounded CSA responses improves customer engagement by 3% - 10%, depending on UX variations.
Chinese: 为解决对话购物助手中大语言模型的事实错误和缺乏可验证性问题,我们开发了可扩展的引用系统,将事实依据性能提升13.83%,并通过增强回答可靠性使客户参与度提高3-10%。
English: To address LLMs' factual inaccuracies and lack of verifiability in conversational shopping agents, we developed a scalable citation system that improves grounding performance by 13.83% and boosts customer engagement by 3-10% through enhanced response reliability.

Authors:Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li
Title: Invisible Walls in Cities: Leveraging Large Language Models to Predict Urban Segregation Experience with Social Media Content
Abstract:
Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose using Large Language Models (LLMs) to automate online review mining for segregation prediction. We design a Reflective LLM Coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE'EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our framework greatly improves prediction accuracy, with a 22.79% elevation in R2 and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction accuracy. Moreover, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving POIs' social inclusiveness. Our study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with AI.
中文摘要:本研究提出了一种利用大型语言模型分析社交媒体评论以预测城市隔离的新框架,显著提高了预测准确性,并为理解社会包容性提供了认知层面的见解。
English Summary: This study introduces a novel framework using Large Language Models to analyze social media reviews for predicting urban segregation, significantly enhancing prediction accuracy and providing cognitive insights into social inclusiveness.

Authors:Zhiqiang Yan, Zhengxue Wang, Haoye Dong, Jun Li, Jian Yang, Gim Hee Lee
Title: DuCos: Duality Constrained Depth Super-Resolution via Foundation Model
Abstract:
We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization.
中文: DuCos是一种基于拉格朗日对偶理论的深度超分辨率新框架,通过相关融合和梯度调节提示将多重约束与重建目标灵活结合,在多种场景下实现了卓越的精度、鲁棒性和泛化能力。
English: DuCos is a novel depth super-resolution framework based on Lagrangian duality that integrates multiple constraints and reconstruction objectives through correlative fusion and gradient regulation prompts, achieving superior accuracy, robustness, and generalization across diverse scenarios.

Authors:Qianliang Wu, Haobo Jiang, Yaqing Ding, Lei Luo, Jun Li, Jin Xie, Xiaojun Wu, Jian Yang
Title: Diff-Reg v2: Diffusion-Based Matching Matrix Estimation for Image Matching and 3D Registration
Abstract:
Establishing reliable correspondences is crucial for all registration tasks, including 2D image registration, 3D point cloud registration, and 2D-3D image-to-point cloud registration. However, these tasks are often complicated by challenges such as scale inconsistencies, symmetry, and large deformations, which can lead to ambiguous matches. Previous feature-based and correspondence-based methods typically rely on geometric or semantic features to generate or polish initial potential correspondences. Some methods typically leverage specific geometric priors, such as topological preservation, to devise diverse and innovative strategies tailored to a given enhancement goal, which cannot be exhaustively enumerated. Additionally, many previous approaches rely on a single-step prediction head, which can struggle with local minima in complex matching scenarios. To address these challenges, we introduce an innovative paradigm that leverages a diffusion model in matrix space for robust matching matrix estimation. Our model treats correspondence estimation as a denoising diffusion process in the matching matrix space, gradually refining the intermediate matching matrix to the optimal one. Specifically, we apply the diffusion model in the doubly stochastic matrix space for 3D-3D and 2D-3D registration tasks. In the 2D image registration task, we deploy the diffusion model in a matrix subspace where dual-softmax projection regularization is applied. For all three registration tasks, we provide adaptive matching matrix embedding implementations tailored to the specific characteristics of each task while maintaining a consistent "match-to-warp" encoding pattern. Furthermore, we adopt a lightweight design for the denoising module. In inference, once points or image features are extracted and fixed, this module performs multi-step denoising predictions through reverse sampling.
Chinese: 本文提出了一种基于扩散模型的新范式,将对应关系估计视为匹配矩阵空间中的去噪过程,为2D、3D和2D-3D配准任务提供了自适应实现方案,同时保持一致的编码模式。
English: This paper introduces a diffusion model-based paradigm that treats correspondence estimation as a denoising process in matching matrix space, providing adaptive implementations for 2D, 3D, and 2D-3D registration tasks while maintaining consistent encoding patterns.

Authors:Shijie Zhu, Hui Zhao, Tianshu Wu, Pengjie Wang, Hongbo Deng, Jian Xu, Bo Zheng
Title: Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning
Abstract:
Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.
Chinese: 多任务学习模型常因梯度冲突而表现不佳,但我们提出的GradOPS方法通过正交投影消除冲突,在多个基准测试中以不同权衡策略实现了最优性能。
English: Multi-task learning models often underperform due to conflicting gradients, but our proposed GradOPS method eliminates these conflicts and achieves state-of-the-art performance with diverse trade-off strategies across multiple benchmarks.

Authors:Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, Haoyuan Li, Weilong Dai, Mingli Song, Jie Song, Hao Jiang
Title: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation
Abstract:
Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.
Chinese: 统一生成模型在处理复杂多条件图像生成时存在不足,而新型MINT模型通过引入多模态思维链(MCoT)和混合Transformer专家架构(MTXpert),在文生图与图生文任务中均展现出卓越性能。
English: Unified generative models often struggle with complex multi-condition image generation, but the new MINT model introduces multimodal chain of thought (MCoT) and Mixture of Transformer Experts (MTXpert) to achieve superior performance in both text-to-image and image-to-text tasks.

Authors:Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang
Title: Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning
Abstract:
Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.
Chinese: 我们的研究发现,结合辅助变量条件化与稀疏混合假设,可在放宽约束条件下实现可识别的解耦表示学习,并提出了经合成和真实数据集验证的实用框架。
English: Our study reveals that combining auxiliary variable conditioning with sparse mixing assumptions enables identifiable disentangled representation learning under relaxed constraints, leading to a practical framework validated across synthetic and real-world datasets.

Authors:Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Xiao Cheng, Jun Huan, Haoyu Wang, Jing Gao
Title: Unlocking Efficient, Scalable, and Continual Knowledge Editing with Basis-Level Representation Fine-Tuning
Abstract:
Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing methods designed to update certain knowledge in LLMs without changing unrelated others. To make selective edits, previous efforts often sought to update a small amount of parameters in some specific layer(s) of a LLM. Nonetheless, in challenging scenarios, they still fall short in making successful edits while preserving knowledge irrelevant to the updates simultaneously, resulting in a notable editing-locality trade-off. In this work, we question if the trade-offs are caused by the fact that parameter-based updates have a global effect, i.e., edited parameters affect all inputs indiscriminately. In light of this, we explore the feasibility of representation fine-tuning, which applied some linear update to a few representations in a learned subspace, for knowledge editing. While being effective to enhance an LLM's general ability as demonstrated in the previous work, we theoretically show that this linear update imposes a tension in editing-locality trade-off. Subsequently, BaFT is proposed to break the linearity. BaFT computes a weight for each basis that spans a dimension of the subspace based on the input representation. This input-dependent weighting mechanism allows BaFT to manage different types of knowledge in an adaptive way, thereby achieving a better editing-locality trade-off. Experiments on three LLMs with five editing benchmarks in diverse scenarios show the superiority of our method.
中文: 大语言模型在更新特定知识时难以避免影响无关信息,而提出的BaFT方法通过输入依赖的加权机制自适应管理知识,在实验中实现了更优的编辑局部性平衡。
English: Large language models face challenges in updating specific knowledge without affecting unrelated information, and the proposed BaFT method addresses this by adaptively managing knowledge through input-dependent weighting, achieving a superior editing-locality trade-off in experiments.

Authors:Yuchen Liu, Junhao Hu, Yingdi Shan, Ge Li, Yanzhen Zou, Yihong Dong, Tao Xie
Title: LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators
Abstract:
Rewriting C code in Rust provides stronger memory safety, yet migrating large codebases such as the 32-million-line Linux kernel remains challenging. While rule-based translators (e.g., C2Rust) provide accurate yet largely unsafe Rust programs, recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness", omitting significant portions of the target code. To address the issue, in this paper, we present LLMigrate, an LLM-based C-to-Rust translation tool that splits modules into discrete functions, translating them individually, and then reintegrating them. LLMigrate uses static analysis to retain necessary context, pairs GPT-4o (a state-of-the-art LLM) with compiler-driven translation and program-repair techniques for complex core functions, and leverages call-graph-guided translation to ensure consistent interfaces. Evaluations on three representative Linux kernel modules (math, sort, and ramfs) show that LLMigrate requires modifying less than 15\% of the target code, significantly outperforming a pure GPT-4o-based migration.
中文:LLMigrate是一种基于大语言模型的先进工具,通过将模块拆分为函数、结合静态分析和编译器驱动修复来将C代码转换为Rust,在确保安全性和完整性的同时,相比纯GPT-4o方法显著减少了手动修改量。
English: LLMigrate is an advanced LLM-based tool that translates C code to Rust by breaking modules into functions, using static analysis and compiler-driven repair to ensure safety and completeness, significantly reducing manual edits compared to pure GPT-4o approaches.

Authors:Song Lai, Zhe Zhao, Fei Zhu, Xi Lin, Qingfu Zhang, Gaofeng Meng
Title: Pareto Continual Learning: Preference-Conditioned Learning and Adaption for Dynamic Stability-Plasticity Trade-off
Abstract:
Continual learning aims to learn multiple tasks sequentially. A key challenge in continual learning is balancing between two objectives: retaining knowledge from old tasks (stability) and adapting to new tasks (plasticity). Experience replay methods, which store and replay past data alongside new data, have become a widely adopted approach to mitigate catastrophic forgetting. However, these methods neglect the dynamic nature of the stability-plasticity trade-off and aim to find a fixed and unchanging balance, resulting in suboptimal adaptation during training and inference. In this paper, we propose Pareto Continual Learning (ParetoCL), a novel framework that reformulates the stability-plasticity trade-off in continual learning as a multi-objective optimization (MOO) problem. ParetoCL introduces a preference-conditioned model to efficiently learn a set of Pareto optimal solutions representing different trade-offs and enables dynamic adaptation during inference. From a generalization perspective, ParetoCL can be seen as an objective augmentation approach that learns from different objective combinations of stability and plasticity. Extensive experiments across multiple datasets and settings demonstrate that ParetoCL outperforms state-of-the-art methods and adapts to diverse continual learning scenarios.
中文: ParetoCL将持续学习重构为多目标优化问题,通过偏好条件模型动态平衡稳定性与可塑性,在多种场景中优于现有方法。
English: ParetoCL reframes continual learning as a multi-objective optimization problem, dynamically balancing stability and plasticity through a preference-conditioned model that outperforms existing methods across diverse scenarios.

Authors:Ziping Dong, Chao Shuai, Zhongjie Ba, Peng Cheng, Zhan Qin, Qinglong Wang, Kui Ren
Title: WMCopier: Forging Invisible Image Watermarks on Arbitrary Images
Abstract:
Invisible Image Watermarking is crucial for ensuring content provenance and accountability in generative AI. While Gen-AI providers are increasingly integrating invisible watermarking systems, the robustness of these schemes against forgery attacks remains poorly characterized. This is critical, as forging traceable watermarks onto illicit content leads to false attribution, potentially harming the reputation and legal standing of Gen-AI service providers who are not responsible for the content. In this work, we propose WMCopier, an effective watermark forgery attack that operates without requiring any prior knowledge of or access to the target watermarking algorithm. Our approach first models the target watermark distribution using an unconditional diffusion model, and then seamlessly embeds the target watermark into a non-watermarked image via a shallow inversion process. We also incorporate an iterative optimization procedure that refines the reconstructed image to further trade off the fidelity and forgery efficiency. Experimental results demonstrate that WMCopier effectively deceives both open-source and closed-source watermark systems (e.g., Amazon's system), achieving a significantly higher success rate than existing methods. Additionally, we evaluate the robustness of forged samples and discuss the potential defenses against our attack.
WMCopier presents a novel watermark forgery attack that successfully bypasses both open-source and commercial systems without requiring knowledge of the target algorithm, posing significant challenges to content authentication in generative AI.
English Summary:

Authors:Puzhen Yuan, Angyuan Ma, Yunchao Yao, Huaxiu Yao, Masayoshi Tomizuka, Mingyu Ding
Title: REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in robotic planning, particularly for long-horizon tasks that require a holistic understanding of the environment for task decomposition. Existing methods typically rely on prior environmental knowledge or carefully designed task-specific prompts, making them struggle with dynamic scene changes or unexpected task conditions, e.g., a robot attempting to put a carrot in the microwave but finds the door was closed. Such challenges underscore two critical issues: adaptability and efficiency. To address them, in this work, we propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution through continuous reflection and self-evolution. REMAC incorporates two key modules: a self-reflection module performing pre-condition and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning. It offers several appealing benefits: 1) Robots can initially explore and reason about the environment without complex prompt design. 2) Robots can keep reflecting on potential planning errors and adapting the plan based on task-specific insights. 3) After iterations, a robot can call another one to coordinate tasks in parallel, maximizing the task execution efficiency. To validate REMAC's effectiveness, we build a multi-agent environment for long-horizon robot manipulation and navigation based on RoboCasa, featuring 4 task categories with 27 task styles and 50+ different objects. Based on it, we further benchmark state-of-the-art reasoning models, including DeepSeek-R1, o3-mini, QwQ, and Grok3, demonstrating REMAC's superiority by boosting average success rates by 40% and execution efficiency by 52.7% over the single robot baseline.
中文摘要:针对视觉语言模型在机器人规划中的适应性与效率问题,我们提出REMAC自适应多智能体框架,通过持续反思与自我进化提升任务执行能力,显著提高了任务成功率与执行效率。
English Summary: Vision-language models face challenges in adaptability and efficiency for robotic planning, so we propose REMAC, an adaptive multi-agent framework that enhances task execution through continuous reflection and self-evolution, significantly improving success rates and efficiency.

Authors:Yong Xie, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
Title: ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer
Abstract:
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fréchet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.
中文: ReCoM是一个高效框架,通过结合动态嵌入正则化的循环嵌入式Transformer和迭代重建推理策略,生成高保真且与语音同步的人体动作,在运动真实感上实现了86.7%的提升并达到最先进性能。
English: ReCoM is an efficient framework that generates high-fidelity, speech-synchronized human motions using a Recurrent Embedded Transformer with Dynamic Embedding Regularization and iterative reconstruction inference, achieving state-of-the-art performance and an 86.7% improvement in motion realism.

Authors:Yupeng Cao, Haohang Li, Yangyang Yu, Shashidhar Reddy Javaji, Yueru He, Jimin Huang, Zining Zhu, Qianqian Xie, Xiao-yang Liu, Koduvayur Subbalakshmi, Meikang Qiu, Sophia Ananiadou, Jian-Yun Nie
Title: FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
Abstract:
Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.
Chinese: AudioLLMs在音频任务上取得进展,但缺乏金融场景的基准,因此我们推出FinAudio来评估其在金融音频数据上的能力,并揭示现有模型的不足。
English: AudioLLMs have advanced audio tasks but lack financial benchmarks, prompting the introduction of FinAudio to evaluate their performance on financial audio data and reveal current limitations.

Authors:Hongru Li, Songjie Xie, Jiawei Shao, Zixin Wang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Mutual Information-Empowered Task-Oriented Communication: Principles, Applications and Challenges
Abstract:
Mutual information (MI)-based guidelines have recently proven to be effective for designing task-oriented communication systems, where the ultimate goal is to extract and transmit task-relevant information for downstream task. This paper provides a comprehensive overview of MI-empowered task-oriented communication, highlighting how MI-based methods can serve as a unifying design framework in various task-oriented communication scenarios. We begin with the roadmap of MI for designing task-oriented communication systems, and then introduce the roles and applications of MI to guide feature encoding, transmission optimization, and efficient training with two case studies. We further elaborate the limitations and challenges of MI-based methods. Finally, we identify several open issues in MI-based task-oriented communication to inspire future research.
中文: 本文全面综述了基于互信息的方法作为任务导向通信系统的统一设计框架,涵盖其在特征编码、传输优化和训练中的应用,并探讨了相关局限性与未来研究方向。
English: This paper presents a comprehensive overview of mutual information-based methods as a unifying framework for designing task-oriented communication systems, covering their applications in feature encoding, transmission optimization, and training while addressing limitations and future research directions.

Authors:Junsong Li, Jie Zhou, Yutao Yang, Bihao Zhan, Qianjun Pan, Yuyang Ding, Qin Chen, Jiang Bo, Xin Lin, Liang He
Title: Teaching LLMs for Step-Level Automatic Math Correction via Reinforcement Learning
Abstract:
Automatic math correction aims to check students' solutions to mathematical problems via artificial intelligence technologies. Most existing studies focus on judging the final answer at the problem level, while they ignore detailed feedback on each step in a math problem-solving process, which requires abilities of semantic understanding and reasoning. In this paper, we propose a reinforcement learning (RL)-based method to boost large language model (LLM) for step-level automatic math correction, named StepAMC. Particularly, we convert the step-level automatic math correction within the text classification task into an RL problem to enhance the reasoning capabilities of LLMs. Then, we design a space-constrained policy network to improve the stability of RL. Then, we introduce a fine-grained reward network to convert the binary human feedback into a continuous value. We conduct extensive experiments over two benchmark datasets and the results show that our model outperforms the eleven strong baselines.
中文: 本文提出StepAMC方法,通过强化学习将数学解题步骤的自动批改转化为RL问题,设计稳定策略网络和细粒度奖励机制,显著提升大语言模型的推理能力,在基准测试中优于现有模型。
English: This paper introduces StepAMC, a reinforcement learning-based method that enhances large language models for step-level automatic math correction by converting it into an RL problem, designing a stable policy network, and using a fine-grained reward system, achieving superior performance over existing baselines.

Authors:Farshad Rostami Ghadi, Masoud Kaveh, Francisco Hernando-Gallego, Diego Martin, Kai-Kit Wong, Chan-Byoung Chae
Title: UAV-Relay Assisted RSMA Fluid Antenna System: Outage Probability Analysis
Abstract:
This letter studies the impact of fluid antenna system (FAS) technology on the performance of unmanned aerial vehicle (UAV)-assisted multiuser communication networks. Specifically, we consider a scenario where a fixed-position antenna (FPA) base station (BS) serves K FAS-equipped users with the assistance of a UAV acting as an aerial relay. The BS employs rate-splitting multiple access (RSMA), while the UAV operates in half-duplex (HD) mode using the decode-and-forward (DF) strategy. For this system, we derive a compact analytical expression for the outage probability (OP) and its asymptotic behavior in the high signal-to-noise ratio (SNR) regime, leveraging the multivariate t-distribution. Our results show how deploying FAS at ground users (GUs) in UAV-aided communications improves overall system performance compared to using FPA GUs.
Chinese: 研究表明,在无人机辅助多用户通信网络中,地面用户采用流体天线系统相比固定位置天线能显著提升整体系统性能,这通过推导出的中断概率表达式得到验证。
English: This study demonstrates that equipping ground users with fluid antenna systems in UAV-assisted multiuser networks significantly enhances overall system performance compared to fixed-position antennas, as shown through derived outage probability expressions.

Authors:Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks
Abstract:
Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.
中文摘要:本文提出了一种基于张量并行和空中计算的分布式设备端大语言模型推理框架,通过联合优化模型分配与收发器设计,有效降低通信开销并提升推理性能。
English Summary: This paper introduces a distributed on-device LLM inference framework using tensor parallelism and over-the-air computation to reduce communication overhead, achieving significant speed acceleration and improved accuracy.

Authors:Han Xiao, Xiaoyan Hu, Kai-Kit Wong, Hanjiang Hong, George C. Alexandropoulos, Chan-Byoung Chae
Title: Fluid Reconfigurable Intelligent Surfaces: Joint On-Off Selection and Beamforming with Discrete Phase Shifts
Abstract:
This letter proposes a fluid reconfigurable intelligent surface (FRIS) paradigm, extending the conventional reconfigurable intelligent surface (RIS) technology to incorporate position reconfigurability of the elements. In our model, a `fluid' element is realized by a dense matrix of subelements over a given space and dynamically selecting specific elements for signal modulation based on channel conditions. Specifically, we consider a FRIS-assisted single-user single-input single-output (SU-SISO) system and formulate an optimization problem that can jointly optimize element selection and their discrete phase shifts to maximize the achievable rate. To address this problem efficiently, we propose an iterative algorithm based on the cross-entropy optimization (CEO) framework. Simulation results reveal that FRIS achieves significant performance gains over traditional RIS.
中文摘要:本文提出一种流体可重构智能表面(FRIS)范式,通过动态选择密集子元件进行信号调制,并采用交叉熵优化算法联合优化元件选择与相位偏移,相比传统RIS实现了显著性能提升。
English Summary: This letter introduces a fluid reconfigurable intelligent surface (FRIS) that enhances traditional RIS by enabling dynamic element positioning and selection for optimized signal modulation, achieving superior performance through a cross-entropy-based iterative algorithm.

Authors:Yuxuan Liang, Haomin Wen, Yutong Xia, Ming Jin, Bin Yang, Flora Salim, Qingsong Wen, Shirui Pan, Gao Cong
Title: Foundation Models for Spatio-Temporal Data Science: A Tutorial and Survey
Abstract:
Spatio-Temporal (ST) data science, which includes sensing, managing, and mining large-scale data across space and time, is fundamental to understanding complex systems in domains such as urban computing, climate science, and intelligent transportation. Traditional deep learning approaches have significantly advanced this field, particularly in the stage of ST data mining. However, these models remain task-specific and often require extensive labeled data. Inspired by the success of Foundation Models (FM), especially large language models, researchers have begun exploring the concept of Spatio-Temporal Foundation Models (STFMs) to enhance adaptability and generalization across diverse ST tasks. Unlike prior architectures, STFMs empower the entire workflow of ST data science, ranging from data sensing, management, to mining, thereby offering a more holistic and scalable approach. Despite rapid progress, a systematic study of STFMs for ST data science remains lacking. This survey aims to provide a comprehensive review of STFMs, categorizing existing methodologies and identifying key research directions to advance ST general intelligence.
中文摘要:时空基础模型(STFMs)作为新兴的综合性解决方案,通过覆盖从感知到挖掘的时空数据科学全流程,突破了传统深度学习模型的任务局限性,正推动时空通用智能的发展。
English Summary: Spatio-Temporal Foundation Models (STFMs) are emerging as holistic solutions that enhance adaptability across all stages of spatio-temporal data science, from sensing to mining, addressing limitations of task-specific deep learning models and advancing ST general intelligence.

Authors:Mengbing Liu, Xin Li, Jiancheng An, Chau Yuen
Title: Onboard Terrain Classification via Stacked Intelligent Metasurface-Diffractive Deep Neural Networks from SAR Level-0 Raw Data
Abstract:
This paper introduces a novel approach for real-time onboard terrain classification from Sentinel-1 (S1) level-0 raw In-phase/Quadrature (IQ) data, leveraging a Stacked Intelligent Metasurface (SIM) to perform inference directly in the analog wave domain. Unlike conventional digital deep neural networks, the proposed multi-layer Diffractive Deep Neural Network (D$^2$NN) setup implements automatic feature extraction as electromagnetic waves propagate through stacked metasurface layers. This design not only reduces reliance on expensive downlink bandwidth and high-power computing at terrestrial stations but also achieves performance levels around 90\% directly from the real raw IQ data, in terms of accuracy, precision, recall, and F1 Score. Our method therefore helps bridge the gap between next-generation remote sensing tasks and in-orbit processing needs, paving the way for computationally efficient remote sensing applications.
中文: 本文提出了一种利用哨兵一号原始数据和堆叠智能超表面进行实时星上地形分类的新方法,通过模拟域推理实现约90%的性能,同时显著降低带宽需求和计算负荷。
English: This paper presents a novel method for real-time onboard terrain classification using Sentinel-1 raw data and a Stacked Intelligent Metasurface, enabling analog domain inference with approximately 90% performance while reducing bandwidth and computational demands.

Authors:Kedi Chen, Zhikai Lei, Fan Zhang, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang
Title: Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences
Abstract:
Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.
中文: 大语言模型在推理能力上取得显著进展,但归纳推理因高质量过程监督数据难以获取而研究不足,为此我们利用数列构建了CodeSeq数据集,通过代码解决方案提升模型在推理基准测试中的表现。
English: Large language models have advanced in reasoning, but inductive reasoning remains understudied due to challenges in obtaining high-quality process supervision data, leading to the creation of the CodeSeq dataset using number sequences and code solutions to enhance model performance on reasoning benchmarks.

Authors:Andong Lu, Yuanzhi Guo, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo
Title: Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking
Abstract:
Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap, which is key factor to effective tracking. Our analysis reveals that shallower fusion yields smaller distribution gap. However, the limited discriminative power of shallow networks hard to distinguish task-relevant information from noise, limiting the potential of pixel-level fusion. To break shallow limits, we propose a novel \textbf{T}ask-driven \textbf{P}ixel-level \textbf{F}usion network, named \textbf{TPF}, which unveils the power of pixel-level fusion in RGBT tracking through a progressive learning framework. In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking. To enhance the fusion capabilities of the PFA, our task-driven progressive learning framework first utilizes adaptive multi-expert distillation to inherits fusion knowledge from state-of-the-art image fusion models, establishing robust initialization, and then employs a decoupled representation learning scheme to achieve task-relevant information fusion. Moreover, to overcome appearance variations between the initial template and search frames, we presents a nearest-neighbor dynamic template updating scheme, which selects the most reliable frame closest to the current search frame as the dynamic template. Extensive experiments demonstrate that TPF significantly outperforms existing most of advanced trackers on four public RGBT tracking datasets. The code will be released upon acceptance.
中文摘要:现有RGBT跟踪方法常忽略融合位置对缩小模态差异的影响,而本文提出的任务驱动像素级融合网络TPF通过渐进式学习框架突破浅层网络局限,在多个公开数据集上显著优于先进跟踪器。
English Summary: Current RGBT tracking methods neglect fusion location's role in reducing modality gaps, but our proposed Task-driven Pixel-level Fusion (TPF) network overcomes shallow network limitations through progressive learning and achieves superior performance on benchmark datasets.

Authors:Andong Lu, Mai Wen, Jinhu Wang, Yuanzhi Guo, Chenglong Li, Jin Tang, Bin Luo
Title: Towards General Multimodal Visual Tracking
Abstract:
Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.
中文:本研究提出了一种融合RGB、热红外、事件和语言四种模态的通用多模态视觉跟踪框架,通过创新的QuadFusion方法和多尺度融合曼巴结构有效解决模态差异与计算负担问题,并在QuadTrack600新基准和现有数据集上验证了其优越性能。
English: This work introduces a general multimodal visual tracking framework that integrates RGB, thermal infrared, event, and language modalities, proposing the QuadFusion method with an efficient Multiscale Fusion Mamba to address computational challenges and validate its effectiveness on the new QuadTrack600 benchmark and existing datasets.

Authors:Zhe Zhao, Haibin Wen, Pengkun Wang, Ye Wei, Zaixi Zhang, Xi Lin, Fei Liu, Bo An, Hui Xiong, Yang Wang, Qingfu Zhang
Title: From Understanding to Excelling: Template-Free Algorithm Design through Structural-Functional Co-Evolution
Abstract:
Large language models (LLMs) have greatly accelerated the automation of algorithm generation and optimization. However, current methods such as EoH and FunSearch mainly rely on predefined templates and expert-specified functions that focus solely on the local evolution of key functionalities. Consequently, they fail to fully leverage the synergistic benefits of the overall architecture and the potential of global optimization. In this paper, we introduce an end-to-end algorithm generation and optimization framework based on LLMs. Our approach utilizes the deep semantic understanding of LLMs to convert natural language requirements or human-authored papers into code solutions, and employs a two-dimensional co-evolution strategy to optimize both functional and structural aspects. This closed-loop process spans problem analysis, code generation, and global optimization, automatically identifying key algorithm modules for multi-level joint optimization and continually enhancing performance and design innovation. Extensive experiments demonstrate that our method outperforms traditional local optimization approaches in both performance and innovation, while also exhibiting strong adaptability to unknown environments and breakthrough potential in structural design. By building on human research, our framework generates and optimizes novel algorithms that surpass those designed by human experts, broadening the applicability of LLMs for algorithm design and providing a novel solution pathway for automated algorithm development.
中文摘要:本文提出了一种基于大语言模型的端到端算法生成与优化框架,通过深度语义理解将自然语言需求转化为代码解决方案,并采用二维协同进化策略进行全局优化,在性能和创新性上均超越传统方法,展现出强大的适应性和突破潜力。
English Summary: This paper introduces an end-to-end LLM-based framework that converts natural language requirements into code solutions and employs a two-dimensional co-evolution strategy for global optimization, outperforming traditional methods in both performance and innovation while demonstrating strong adaptability.

Authors:Jing Xu, Franziska Boenisch, Iyiola Emmanuel Olatunji, Adam Dziedzic
Title: DP-GPL: Differentially Private Graph Prompt Learning
Abstract:
Graph Neural Networks (GNNs) have shown remarkable performance in various applications. Recently, graph prompt learning has emerged as a powerful GNN training paradigm, inspired by advances in language and vision foundation models. Here, a GNN is pre-trained on public data and then adapted to sensitive tasks using lightweight graph prompts. However, using prompts from sensitive data poses privacy risks. In this work, we are the first to investigate these practical risks in graph prompts by instantiating a membership inference attack that reveals significant privacy leakage. We also find that the standard privacy method, DP-SGD, fails to provide practical privacy-utility trade-offs in graph prompt learning, likely due to the small number of sensitive data points used to learn the prompts. As a solution, we propose DP-GPL for differentially private graph prompt learning based on the PATE framework, that generates a graph prompt with differential privacy guarantees. Our evaluation across various graph prompt learning methods, GNN architectures, and pre-training strategies demonstrates that our algorithm achieves high utility at strong privacy, effectively mitigating privacy concerns while preserving the powerful capabilities of prompted GNNs as powerful foundation models in the graph domain.
Chinese: 本研究提出了DP-GPL,一种差分隐私图提示学习方法,在保护敏感数据隐私的同时保持了图神经网络的高效性能,解决了DP-SGD等标准方法在图提示学习中的局限性。
English: This study introduces DP-GPL, a differentially private graph prompt learning method that effectively mitigates privacy risks in sensitive data adaptation while maintaining high utility for GNNs, addressing the limitations of standard approaches like DP-SGD.

Authors:Luca Scimeca, Siddarth Venkatraman, Moksh Jain, Minsu Kim, Marcin Sendera, Mohsin Hasan, Luke Rowe, Sarthak Mittal, Pablo Lemos, Emmanuel Bengio, Alexandre Adam, Jarrid Rector-Brooks, Yashar Hezaveh, Laurence Perreault-Levasseur, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Title: Solving Bayesian inverse problems with diffusion priors and off-policy RL
Abstract:
This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (RL) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.
中文摘要:本研究应用相对轨迹平衡方法训练条件扩散模型以解决复杂逆问题,结果表明标准方法因固有偏差而难以在潜在空间进行有效推断。
English Summary: This study applies Relative Trajectory Balance to train conditional diffusion models for solving complex inverse problems, demonstrating that standard methods fail in latent space inference due to inherent biases.

Authors:Farshad Rostami Ghadi, Kai-Kit Wong, Masoud Kaveh, F. Javier Lopez-Martinez, Yuanwei Liu, Chan-Byoung Chae, Ross Murch
Title: Phase-mismatched STAR-RIS with FAS-assisted RSMA Users
Abstract:
This paper considers communication between a base station (BS) to two users, each from one side of a simultaneously transmitting-reflecting reconfigurable intelligent surface (STAR-RIS) in the absence of a direct link. Rate-splitting multiple access (RSMA) strategy is employed and the STAR-RIS is subjected to phase errors. The users are equipped with a planar fluid antenna system (FAS) with position reconfigurability for spatial diversity. First, we derive the distribution of the equivalent channel gain at the FAS-equipped users, characterized by a t-distribution. We then obtain analytical expressions for the outage probability (OP) and average capacity (AC), with the latter obtained via a heuristic approach. Our findings highlight the potential of FAS to mitigate phase imperfections in STAR-RIS-assisted communications, significantly enhancing system performance compared to traditional antenna systems (TAS). Also, we quantify the impact of practical phase errors on system efficiency, emphasizing the importance of robust strategies for next-generation wireless networks.
Chinese: 本研究探讨了在STAR-RIS辅助通信中采用速率分割多址接入和流体天线系统来减轻相位误差的影响,相比传统天线系统显著提升了性能表现。
English: This study explores the use of rate-splitting multiple access and fluid antenna systems to mitigate phase errors in STAR-RIS-assisted communications, demonstrating significant performance improvements over traditional antenna systems.

Authors:Yifei Deng, Chenglong Li, Zhenyu Chen, Zihen Xu, Jin Tang
Title: Decoupled Cross-Modal Alignment Network for Text-RGBT Person Retrieval and A High-Quality Benchmark
Abstract:
The performance of traditional text-image person retrieval task is easily affected by lighting variations due to imaging limitations of visible spectrum sensors. In recent years, cross-modal information fusion has emerged as an effective strategy to enhance retrieval robustness. By integrating complementary information from different spectral modalities, it becomes possible to achieve more stable person recognition and matching under complex real-world conditions. Motivated by this, we introduce a novel task: Text-RGBT Person Retrieval, which incorporates cross-spectrum information fusion by combining the complementary cues from visible and thermal modalities for robust person retrieval in challenging environments. The key challenge of Text-RGBT person retrieval lies in aligning text with multi-modal visual features. However, the inherent heterogeneity between visible and thermal modalities may interfere with the alignment between vision and language. To handle this problem, we propose a Decoupled Cross-modal Alignment network (DCAlign), which sufficiently mines the relationships between modality-specific and modality-collaborative visual with the text, for Text-RGBT person retrieval. To promote the research and development of this field, we create a high-quality Text-RGBT person retrieval dataset, RGBT-PEDES. RGBT-PEDES contains 1,822 identities from different age groups and genders with 4,723 pairs of calibrated RGB and T images, and covers high-diverse scenes from both daytime and nighttime with a various of challenges such as occlusion, weak alignment and adverse lighting conditions. Additionally, we carefully annotate 7,987 fine-grained textual descriptions for all RGBT person image pairs. Extensive experiments on RGBT-PEDES demonstrate that our method outperforms existing text-image person retrieval methods.
中文: 传统文本-图像行人检索易受光照变化影响,因此引入新型文本-RGBT行人检索任务,通过融合可见光与热成像模态的互补信息,提出了解耦跨模态对齐网络(DCAlign),并在新构建的RGBT-PEDES数据集上验证了其优越性能。
English: Traditional text-image person retrieval is vulnerable to lighting variations, prompting the introduction of a novel Text-RGBT Person Retrieval task that leverages cross-spectrum fusion of visible and thermal modalities, addressed by a Decoupled Cross-modal Alignment network (DCAlign) and validated on a newly created RGBT-PEDES dataset.

Authors:Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
Title: RePO: ReLU-based Preference Optimization
Abstract:
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $β$, subsequent methods like SimPO reintroduce complexity through dual parameters ($β$, $γ$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $β$ via two advances: (1) retaining SimPO's reference-free margins but removing $β$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($β\to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
中文: RePO提出了一种简化算法,通过梯度分析和基于ReLU的最大间隔损失消除了超参数β,仅需调整一个超参数即可在性能上超越DPO和SimPO。
English: RePO introduces a streamlined algorithm that eliminates the hyperparameter β through gradient analysis and a ReLU-based max-margin loss, outperforming DPO and SimPO with only one tunable hyperparameter.

Authors:Hao Xu, Tengfei Xue, Dongnan Liu, Yuqian Chen, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O'Donnell, Weidong Cai
Title: MultiCo3D: Multi-Label Voxel Contrast for One-Shot Incremental Segmentation of 3D Neuroimages
Abstract:
3D neuroimages provide a comprehensive view of brain structure and function, aiding in precise localization and functional connectivity analysis. Segmentation of white matter (WM) tracts using 3D neuroimages is vital for understanding the brain's structural connectivity in both healthy and diseased states. One-shot Class Incremental Semantic Segmentation (OCIS) refers to effectively segmenting new (novel) classes using only a single sample while retaining knowledge of old (base) classes without forgetting. Voxel-contrastive OCIS methods adjust the feature space to alleviate the feature overlap problem between the base and novel classes. However, since WM tract segmentation is a multi-label segmentation task, existing single-label voxel contrastive-based methods may cause inherent contradictions. To address this, we propose a new multi-label voxel contrast framework called MultiCo3D for one-shot class incremental tract segmentation. Our method utilizes uncertainty distillation to preserve base tract segmentation knowledge while adjusting the feature space with multi-label voxel contrast to alleviate feature overlap when learning novel tracts and dynamically weighting multi losses to balance overall loss. We compare our method against several state-of-the-art (SOTA) approaches. The experimental results show that our method significantly enhances one-shot class incremental tract segmentation accuracy across five different experimental setups on HCP and Preto datasets.
中文摘要:提出的MultiCo3D框架通过结合不确定性蒸馏与多标签体素对比方法,有效解决了单次类增量白质束分割中的特征重叠问题,在多个实验设置中显著优于现有先进方法。
English Summary: The proposed MultiCo3D framework addresses limitations in one-shot class incremental white matter tract segmentation by combining uncertainty distillation with multi-label voxel contrast, significantly outperforming existing methods across multiple experimental setups.

Authors:Piotr Żelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg
Title: Training and Inference Efficiency of Encoder-Decoder Speech Models
Abstract:
Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.
中文: 主流语音模型的注意力编码器-解码器架构因序列数据采样低效导致超50%计算浪费于填充,但通过优化训练和调整模型参数,可在保持精度下实现批量增大5倍、GPU减少4倍及推理加速3倍。
English: The attention encoder-decoder architecture in top speech models suffers from inefficient sequential data sampling, causing over 50% computational waste on padding, but optimizing training and adjusting model parameters can achieve up to 5x larger batch sizes, 4x fewer GPUs, and 3x faster inference while maintaining accuracy.

Authors:Junyuan Mao, Fanci Meng, Yifan Duan, Miao Yu, Xiaojun Jia, Junfeng Fang, Yuxuan Liang, Kun Wang, Qingsong Wen
Title: AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management
Abstract:
Large Language Model based multi-agent systems are revolutionizing autonomous communication and collaboration, yet they remain vulnerable to security threats like unauthorized access and data breaches. To address this, we introduce AgentSafe, a novel framework that enhances MAS security through hierarchical information management and memory protection. AgentSafe classifies information by security levels, restricting sensitive data access to authorized agents. AgentSafe incorporates two components: ThreatSieve, which secures communication by verifying information authority and preventing impersonation, and HierarCache, an adaptive memory management system that defends against unauthorized access and malicious poisoning, representing the first systematic defense for agent memory. Experiments across various LLMs show that AgentSafe significantly boosts system resilience, achieving defense success rates above 80% under adversarial conditions. Additionally, AgentSafe demonstrates scalability, maintaining robust performance as agent numbers and information complexity grow. Results underscore effectiveness of AgentSafe in securing MAS and its potential for real-world application.
Chinese: AgentSafe是一种新型框架,通过分层信息管理和内存保护增强多智能体系统的安全性,在实验中防御成功率超过80%,并展现出强大的可扩展性。
English: AgentSafe is a new framework that enhances the security of multi-agent systems by using hierarchical information management and memory protection, achieving over 80% defense success rates in tests and showing strong scalability.

Authors:Tianyi Liao, Wei Guo, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Joint Beamforming and Antenna Position Optimization for Fluid Antenna-Assisted MU-MIMO Networks
Abstract:
The fluid antenna system (FAS) is a disruptive tech-nology for future wireless communication networks. This paper considers the joint optimization of beamforming matrices and antenna positions for weighted sum rate (WSR) maximization in fluid antenna (FA)-assisted multiuser multiple-input multiple-output (MU-MIMO) networks, which presents significant chal-lenges due to the strong coupling between beamforming and FA positions, the non-concavity of the WSR objective function, and high computational complexity. To address these challenges, we first propose a novel block coordinate ascent (BCA)-based method that employs matrix fractional programming techniques to reformulate the original complex problem into a more tractable form. Then, we develop a parallel majorization maximization (MM) algorithm capable of optimizing all FA positions simul-taneously. To further reduce computational costs, we propose a decentralized implementation based on the decentralized base-band processing (DBP) architecture. Simulation results demon-strate that our proposed algorithm not only achieves significant WSR improvements over conventional MIMO networks but also outperforms the existing method. Moreover, the decentralized implementation substantially reduces computation time while maintaining similar performance compared with the centralized implementation.
中文: 本文提出了一种结合块坐标上升与并行最大化技术的创新算法,通过联合优化多用户MIMO网络中的波束成形和流体天线位置,在降低计算复杂度的同时实现了比传统系统更优的加权和速率性能,并采用分布式架构进一步提升效率。
English: This paper introduces a novel algorithm combining block coordinate ascent and parallel majorization maximization to optimize beamforming and fluid antenna positions in multiuser MIMO networks, achieving significant weighted sum rate improvements over conventional systems with reduced computational complexity through decentralized implementation.

Authors:Changliang Zhou, Xi Lin, Zhenkun Wang, Qingfu Zhang
Title: Learning to Reduce Search Space for Generalizable Neural Routing Solver
Abstract:
Constructive neural combinatorial optimization (NCO) has attracted growing research attention due to its ability to solve complex routing problems without relying on handcrafted rules. However, existing NCO methods face significant challenges in generalizing to large-scale problems due to high computational complexity and inefficient capture of structural patterns. To address this issue, we propose a novel learning-based search space reduction method that adaptively selects a small set of promising candidate nodes at each step of the constructive NCO process. Unlike traditional methods that rely on fixed heuristics, our selection model dynamically prioritizes nodes based on learned patterns, significantly reducing the search space while maintaining solution quality. Experimental results demonstrate that our method, trained solely on 100-node instances from uniform distribution, generalizes remarkably well to large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) instances with up to 1 million nodes from the uniform distribution and over 80K nodes from other distributions.
Chinese: 针对构造式神经组合优化难以扩展至大规模问题,我们提出一种基于学习的搜索空间缩减方法,通过动态选择候选节点,实现了对百万节点规模问题的泛化能力且保持解的质量。
English: Constructive neural combinatorial optimization faces scalability challenges, so we propose a learning-based search space reduction method that dynamically selects promising nodes, enabling generalization to large-scale problems with up to one million nodes while maintaining solution quality.

Authors:Kexin Huang, Junkang Wu, Ziqian Chen, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
Title: Larger or Smaller Reward Margins to Select Preferences for Alignment?
Abstract:
Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on either explicit or implicit reward margins, they often provide contradictory evaluations for the same data. To address this issue, we introduce the alignment potential metric, which quantifies the gap from the model's current implicit reward margin to the target explicit reward margin, thereby estimating the model's potential to align with the preference data. Empirical results demonstrate that training on data selected by this metric consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives. Furthermore, our method extends to self-play data generation frameworks, where the metric is used to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-art (SOTA) results across various training settings and demonstrates continuous improvements in alignment performance as dataset size and training iterations increase.
中文: 我们提出了对齐潜力指标,通过量化模型当前隐含奖励与目标显性奖励之间的差距来解决数据质量评估的矛盾,该指标在不同模型和自生成数据场景中均能持续提升对齐性能并超越现有最优结果。
English: The alignment potential metric is introduced to resolve contradictory data quality assessments by measuring the gap between implicit and explicit reward margins, consistently improving alignment performance across various models and enabling superior self-play data generation.

Authors:Jiahui Geng, Qing Li, Herbert Woisetschlaeger, Zongxiong Chen, Fengyu Cai, Yuxia Wang, Preslav Nakov, Hans-Arno Jacobsen, Fakhri Karray
Title: A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models
Abstract:
This study investigates the machine unlearning techniques within the context of large language models (LLMs), referred to as \textit{LLM unlearning}. LLM unlearning offers a principled approach to removing the influence of undesirable data (e.g., sensitive or illegal information) from LLMs, while preserving their overall utility without requiring full retraining. Despite growing research interest, there is no comprehensive survey that systematically organizes existing work and distills key insights; here, we aim to bridge this gap. We begin by introducing the definition and the paradigms of LLM unlearning, followed by a comprehensive taxonomy of existing unlearning studies. Next, we categorize current unlearning approaches, summarizing their strengths and limitations. Additionally, we review evaluation metrics and benchmarks, providing a structured overview of current assessment methodologies. Finally, we outline promising directions for future research, highlighting key challenges and opportunities in the field.
中文: 本研究对大型语言模型的机器遗忘技术进行全面综述,系统梳理了现有方法、评估指标及未来研究方向,旨在有效移除不良数据影响的同时保持模型性能。
English: This study provides a comprehensive survey on machine unlearning techniques for large language models, systematically organizing existing approaches, evaluation metrics, and future research directions to remove undesirable data while maintaining model utility.

Authors:Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokul Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Larry Murray, Preslav Nakov
Title: Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Abstract:
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
中文:Sherkala-Chat (8B) 是一款专为哈萨克语设计的先进开放生成语言模型,通过多语言训练和负责任的对齐,在知识和推理方面表现卓越,旨在提升哈萨克语使用者的包容性。
English: Sherkala-Chat (8B) is an advanced open generative language model tailored for Kazakh, excelling in knowledge and reasoning while promoting inclusivity through multilingual training and responsible alignment.

Authors:Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokulakrishnan Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Preslav Nakov
Title: Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Abstract:
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outper-forming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. To ensure effective and responsible alignment, we leverage translated instruction datasets, a Kazakhstan-specific instruction dataset that is automatically constructed and manually verified, and Kazakh-specific safety data. We release Sherkala-Chat (8B) as an open-weight model, along with a detailed description of its training, alignment, and evaluation, to support research and real-world applications for Kazakh speakers.
中文:Sherkala-Chat (8B) 是一款专为哈萨克语设计的先进开放生成语言模型,通过多语言训练和负责任的对齐,在知识和推理方面表现卓越,旨在提升哈萨克语使用者的包容性。
English: Sherkala-Chat (8B) is an advanced open generative language model tailored for Kazakh, excelling in knowledge and reasoning while promoting inclusivity through multilingual training and responsible alignment.

Authors:Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, Liang He
Title: CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Abstract:
Multimodal large language models (MLLMs) have garnered widespread attention from researchers due to their remarkable understanding and generation capabilities in visual language tasks (e.g., visual question answering). However, the rapid pace of knowledge updates in the real world makes offline training of MLLMs costly, and when faced with non-stationary data streams, MLLMs suffer from catastrophic forgetting during learning. In this paper, we propose an MLLMs-based dual momentum Mixture-of-Experts (CL-MoE) framework for continual visual question answering (VQA). We integrate MLLMs with continual learning to utilize the rich commonsense knowledge in LLMs. We introduce a Dual-Router MoE (RMoE) strategy to select the global and local experts using task-level and instance-level routers, to robustly assign weights to the experts most appropriate for the task. Then, we design a dynamic Momentum MoE (MMoE) to update the parameters of experts dynamically based on the relationships between the experts and tasks/instances, so that the model can absorb new knowledge while maintaining existing knowledge. The extensive experimental results indicate that our method achieves state-of-the-art performance on 10 VQA tasks, proving the effectiveness of our approach.
中文: 本文提出了一种基于双动量专家混合(CL-MoE)的框架,通过任务级和实例级路由器动态更新专家参数,使多模态大语言模型在持续学习视觉问答任务时既能吸收新知识又能保留原有知识,有效解决了灾难性遗忘问题。
English: This paper introduces a dual momentum Mixture-of-Experts (CL-MoE) framework that integrates multimodal large language models with continual learning, employing task-level and instance-level routers to dynamically update expert parameters for effective visual question answering while preventing catastrophic forgetting.

Authors:Jing Xu, Franziska Boenisch, Adam Dziedzic
Title: ADAGE: Active Defenses Against GNN Extraction
Abstract:
Graph Neural Networks (GNNs) achieve high performance in various real-world applications, such as drug discovery, traffic states prediction, and recommendation systems. The fact that building powerful GNNs requires a large amount of training data, powerful computing resources, and human expertise turns the models into lucrative targets for model stealing attacks. Prior work has revealed that the threat vector of stealing attacks against GNNs is large and diverse, as an attacker can leverage various heterogeneous signals ranging from node labels to high-dimensional node embeddings to create a local copy of the target GNN at a fraction of the original training costs. This diversity in the threat vector renders the design of effective and general defenses challenging and existing defenses usually focus on one particular stealing setup. Additionally, they solely provide means to identify stolen model copies rather than preventing the attack. To close this gap, we propose the first and general Active Defense Against GNN Extraction (ADAGE). ADAGE builds on the observation that stealing a model's full functionality requires highly diverse queries to leak its behavior across the input space. Our defense monitors this query diversity and progressively perturbs outputs as the accumulated leakage grows. In contrast to prior work, ADAGE can prevent stealing across all common attack setups. Our extensive experimental evaluation using six benchmark datasets, four GNN models, and three types of adaptive attackers shows that ADAGE penalizes attackers to the degree of rendering stealing impossible, whilst preserving predictive performance on downstream tasks. ADAGE, thereby, contributes towards securely sharing valuable GNNs in the future.
中文: 图神经网络(GNNs)易受模型窃取攻击,而提出的ADAGE防御方法通过监控查询多样性并逐步扰动输出,能有效阻止各类攻击,同时保持模型性能。
English: Graph Neural Networks (GNNs) are vulnerable to model stealing attacks, but the proposed ADAGE defense effectively prevents such attacks across all setups by monitoring query diversity and perturbing outputs, while maintaining model performance.

Authors:Haochen Liu, Song Wang, Chen Chen, Jundong Li
Title: Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models
Abstract:
Large Language Models (LLMs) often struggle with tasks requiring external knowledge, such as knowledge-intensive Multiple Choice Question Answering (MCQA). Integrating Knowledge Graphs (KGs) can enhance reasoning; however, existing methods typically demand costly fine-tuning or retrieve noisy KG information. Recent approaches leverage Graph Neural Networks (GNNs) to generate KG-based input embedding prefixes as soft prompts for LLMs but fail to account for question relevance, resulting in noisy prompts. Moreover, in MCQA tasks, the absence of relevant KG knowledge for certain answer options remains a significant challenge. To address these issues, we propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance. QAP employs global attention to capture inter-option relationships, enriching soft prompts with inferred knowledge. Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.
中文: 提出的问题感知知识图谱提示方法通过动态整合问题相关知识图谱信息并利用全局注意力机制应对知识缺失,显著提升了大型语言模型在知识密集型任务中的表现,并在多个数据集上取得领先结果。
English: The proposed Question-Aware Knowledge Graph Prompting (QAP) method enhances LLM performance in knowledge-intensive tasks by dynamically integrating question-relevant KG information and leveraging global attention to address missing knowledge, achieving superior results across multiple datasets.

Authors:Song Wang, Junhong Lin, Xiaojie Guo, Julian Shun, Jundong Li, Yada Zhu
Title: Reasoning of Large Language Models over Knowledge Graphs with Super-Relations
Abstract:
While large language models (LLMs) have made significant progress in processing and reasoning over knowledge graphs, current methods suffer from a high non-retrieval rate. This limitation reduces the accuracy of answering questions based on these graphs. Our analysis reveals that the combination of greedy search and forward reasoning is a major contributor to this issue. To overcome these challenges, we introduce the concept of super-relations, which enables both forward and backward reasoning by summarizing and connecting various relational paths within the graph. This holistic approach not only expands the search space, but also significantly improves retrieval efficiency. In this paper, we propose the ReKnoS framework, which aims to Reason over Knowledge Graphs with Super-Relations. Our framework's key advantages include the inclusion of multiple relation paths through super-relations, enhanced forward and backward reasoning capabilities, and increased efficiency in querying LLMs. These enhancements collectively lead to a substantial improvement in the successful retrieval rate and overall reasoning performance. We conduct extensive experiments on nine real-world datasets to evaluate ReKnoS, and the results demonstrate the superior performance of ReKnoS over existing state-of-the-art baselines, with an average accuracy gain of 2.92%.
Chinese: 针对知识图谱推理中高未检索率的问题,我们提出ReKnoS框架,通过超关系概念实现双向推理并扩展搜索空间,在九个真实数据集上平均准确率提升2.92%,显著提高了检索效率和整体性能。
English: To address the high non-retrieval rate in knowledge graph reasoning, we introduce the ReKnoS framework with super-relations, enabling bidirectional reasoning and expanding the search space to significantly boost retrieval efficiency and accuracy by an average of 2.92% across nine datasets.

Authors:Sen Zhang, Qingqing Ye, Haibo Hu, Jianliang Xu
Title: AdvSGM: Differentially Private Graph Learning via Adversarial Skip-gram Model
Abstract:
The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individual privacy in data analysis. Nevertheless, when applying differential privacy to skip-gram in graphs, it becomes highly challenging due to the complex link relationships, which potentially result in high sensitivity and necessitate substantial noise injection. To tackle this challenge, we present AdvSGM, a differentially private skip-gram for graphs via adversarial training. Our core idea is to leverage adversarial training to privatize skip-gram while improving its utility. Towards this end, we develop a novel adversarial training module by devising two optimizable noise terms that correspond to the parameters of a skip-gram. By fine-tuning the weights between modules within AdvSGM, we can achieve differentially private gradient updates without additional noise injection. Extensive experimental results on six real-world graph datasets show that AdvSGM preserves high data utility across different downstream tasks.
中文: 图嵌入中的跳字模型因包含敏感链接信息而存在隐私风险,但提出的AdvSGM通过对抗性训练与可优化噪声,无需额外噪声注入即可实现差分隐私,并在下游任务中保持高数据效用。
English: The skip-gram model in graph embedding poses privacy risks due to sensitive linkage data, but the proposed AdvSGM uses adversarial training with optimizable noise to achieve differential privacy without extra noise while maintaining high utility in downstream tasks.

Authors:Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Title: ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Abstract:
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
中文摘要:大语言模型在跨学科研究中展现出作为“研究假设挖掘器”的潜力,尤其在检索新颖灵感方面表现突出,但其完整的科学发现能力仍需通过专门基准测试来系统评估。
English Summary: Large language models show promise as "research hypothesis mines" by excelling at retrieving novel inspirations and generating innovative hypotheses across disciplines, though their full potential requires dedicated benchmarks to evaluate scientific discovery capabilities.

Authors:Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni
Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
Abstract:
Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
中文: 本研究提出名为翻转学习的多智能体强化学习框架,仅使用2D/3D边界框实现乳腺超声结节的弱监督分割,通过创新的环境编码和奖励机制,达到了与全监督方法相当的性能。
English: This study introduces Flip Learning, a multi-agent reinforcement learning framework that uses only 2D/3D boxes for weakly-supervised nodule segmentation in breast ultrasound, achieving performance comparable to fully-supervised methods through innovative environment encoding and reward mechanisms.

Authors:Tianqi He, Xiaohan Huang, Yi Du, Qingqing Long, Ziyue Qiao, Min Wu, Yanjie Fu, Yuanchun Zhou, Meng Xiao
Title: FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies
Abstract:
Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced strategies.We first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.
中文摘要:特征变换在机器学习中至关重要,旨在生成特征组合以提升下游任务性能,但现有方法在评估耗时、多样性不足和反馈稀疏方面存在挑战;FastFT框架通过解耦评估、引入新颖性奖励和优先级记忆缓冲,有效解决了这些问题,提高了处理复杂特征变换任务的效率和性能。
English Summary: Feature Transformation is essential in machine learning for creating feature combinations to improve task performance, but current methods face challenges in evaluation time, diversity, and sparse feedback, which FastFT addresses by decoupling evaluation, incorporating novelty in rewards, and using a prioritized memory buffer to enhance efficiency and effectiveness.

Authors:Yaofei Duan, Tao Tan, Zhiyuan Zhu, Yuhao Huang, Yuanji Zhang, Rui Gao, Patrick Cheong-Iao Pang, Xinru Gao, Guowei Tao, Xiang Cong, Zhou Li, Lianying Liang, Guangzhi He, Linliang Yin, Xuedong Deng, Xin Yang, Dong Ni
Title: FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis
Abstract:
Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex's anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: https://dyf1023.github.io/FetalFlex/.
中文摘要:本研究提出的FetalFlex框架无需异常数据即可生成多平面胎儿超声图像,通过可控合成技术显著提升AI模型训练效果与异常检测能力,在多项评估中表现优异。
English Summary: The study introduces FetalFlex, a novel framework that generates high-quality fetal ultrasound images across multiple planes without abnormal data, enhancing AI model training and diagnostic capabilities through controllable synthesis.

Authors:Varich Boonsanong, Vidhisha Balachandran, Xiaochuang Han, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov
Title: FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text
Abstract:
With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience. We develop Facts&Evidence - an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with an explanation of model decisions and attribution to multiple, diverse evidence sources. Facts&Evidence aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.
中文: 本文介绍了Facts&Evidence工具,它通过分解复杂文本、可视化声明可信度并关联多元证据来源,旨在提升AI生成内容验证的透明度和用户参与度。
English: The paper introduces Facts&Evidence, an interactive tool designed to enhance the transparency and user-driven verification of AI-generated content by breaking down complex texts, visualizing claim credibility, and attributing evidence from diverse sources.

Authors:Qiushuo Hou, Sangwoo Park, Matteo Zecchin, Yunlong Cai, Guanding Yu, Osvaldo Simeone
Title: Online Conformal Probabilistic Numerics via Adaptive Edge-Cloud Offloading
Abstract:
Consider an edge computing setting in which a user submits queries for the solution of a linear system to an edge processor, which is subject to time-varying computing availability. The edge processor applies a probabilistic linear solver (PLS) so as to be able to respond to the user's query within the allotted time and computing budget. Feedback to the user is in the form of a set of plausible solutions. Due to model misspecification, the highest-probability-density (HPD) set obtained via a direct application of PLS does not come with coverage guarantees with respect to the true solution of the linear system. This work introduces a new method to calibrate the HPD sets produced by PLS with the aim of guaranteeing long-term coverage requirements. The proposed method, referred to as online conformal prediction-PLS (OCP-PLS), assumes sporadic feedback from cloud to edge. This enables the online calibration of uncertainty thresholds via online conformal prediction (OCP), an online optimization method previously studied in the context of prediction models. The validity of OCP-PLS is verified via experiments that bring insights into trade-offs between coverage, prediction set size, and cloud usage.
在边缘计算场景中,本研究提出OCP-PLS方法,通过在线共形预测校准概率线性求解器的输出,利用云端的间歇反馈来保证线性系统真解在长期统计中的覆盖可靠性。
In an edge computing scenario, this work introduces OCP-PLS, an online conformal prediction method that calibrates probabilistic linear solver outputs to maintain long-term coverage guarantees for linear system solutions despite sporadic cloud feedback and model limitations.

Authors:Yichao Zhang, Ningyuan Deng, Xinyuan Song, Ziqian Bi, Tianyang Wang, Zheyu Yao, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Li Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence KQ Yan, Hongming Tseng, Yan Zhong, Yunze Wang, Ziyuan Qin, Bowen Jing, Junjie Yang, Jun Zhou, Chia Xin Liang, Junhao Song
Title: Advanced Deep Learning Methods for Protein Structure Prediction and Design
Abstract:
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules. The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture, thereby illustrating the current state of the art in computational protein modelling. Subsequent chapters focus on practical applications, presenting case studies that range from individual protein predictions to complex biomolecular interactions. Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored. The later sections review the industry landscape of protein design, highlighting the transformative role of artificial intelligence in biotechnology and discussing emerging market trends and future challenges. Supplementary appendices provide essential resources such as databases and open source tools, making this volume a valuable reference for researchers and students.
中文摘要:本文系统探讨了深度学习在蛋白质结构预测与设计中的前沿应用,涵盖架构创新、实践案例和行业变革,并附有重要研究资源供学者参考。
English Summary: This abstract comprehensively reviews cutting-edge deep learning applications in protein structure prediction and design, covering architectural innovations, practical implementations, and industry impacts while providing essential research resources.

Authors:Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, Hengshu Zhu
Title: SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Abstract:
In recent years, the rapid advancement of Artificial Intelligence (AI) technologies, particularly Large Language Models (LLMs), has revolutionized the paradigm of scientific discovery, establishing AI-for-Science (AI4Science) as a dynamic and evolving field. However, there is still a lack of an effective framework for the overall assessment of AI4Science, particularly from a holistic perspective on data quality and model capability. Therefore, in this study, we propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science from both scientific data and LLM perspectives. First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance-which are subdivided into 15 sub-dimensions. Drawing on data resource papers published between 2018 and 2023 in peer-reviewed journals, we present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field. Concurrently, to assess the capabilities of LLMs across multiple scientific disciplines, we establish 16 assessment dimensions based on five core indicators Knowledge, Understanding, Reasoning, Multimodality, and Values spanning Mathematics, Physics, Chemistry, Life Sciences, and Earth and Space Sciences. Using the developed benchmark datasets, we have conducted a comprehensive evaluation of over 50 representative open-source and closed source LLMs. All the results are publicly available and can be accessed online at www.scihorizon.cn/en.
中文: 本研究提出SciHorizon框架,通过从科学数据的四个维度和跨16个科学领域的50多个大语言模型能力两方面,全面评估AI4Science的发展水平,所有评估结果均在线公开。
English: This study introduces SciHorizon, a comprehensive framework for evaluating AI4Science by assessing both AI-ready scientific data across four dimensions and the capabilities of over 50 large language models across 16 scientific disciplines, with all results accessible online.

Authors:Dewei Zhou, Mingwei Li, Zongxin Yang, Yi Yang
Title: DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
Abstract:
Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: https://limuloo.github.io/DreamRenderer/.
中文:DreamRenderer提出了一种无需训练的方法,通过桥接图像令牌和针对性硬属性绑定技术,有效提升了图像合成中对多实例内容的精确控制能力,显著优于现有模型的性能表现。
English: DreamRenderer introduces a training-free method that enhances multi-instance content control in image synthesis by using bridge image tokens and targeted hard attribute binding, significantly improving precision and performance over existing models.

Authors:Yilong Wu, Yifan Duan, Yuxi Chen, Xinran Zhang, Yedong Shen, Jianmin Ji, Yanyong Zhang, Lu Zhang
Title: MT-PCR: Leveraging Modality Transformation for Large-Scale Point Cloud Registration with Limited Overlap
Abstract:
Large-scale scene point cloud registration with limited overlap is a challenging task due to computational load and constrained data acquisition. To tackle these issues, we propose a point cloud registration method, MT-PCR, based on Modality Transformation. MT-PCR leverages a BEV capturing the maximal overlap information to improve the accuracy and utilizes images to provide complementary spatial features. Specifically, MT-PCR converts 3D point clouds to BEV images and eastimates correspondence by 2D image keypoints extraction and matching. Subsequently, the 2D correspondence estimates are then transformed back to 3D point clouds using inverse mapping. We have applied MT-PCR to Terrestrial Laser Scanning and Aerial Laser Scanning point cloud registration on the GrAco dataset, involving 8 low-overlap, square-kilometer scale registration scenarios. Experiments and comparisons with commonly used methods demonstrate that MT-PCR can achieve superior accuracy and robustness in large-scale scenes with limited overlap.
中文摘要:MT-PCR方法通过将3D点云转换为鸟瞰图进行二维特征匹配,再将对应关系逆向映射回三维空间,有效解决了大范围低重叠点云配准难题,在有限重叠场景中展现出卓越精度。
English Summary: The MT-PCR method addresses large-scale point cloud registration with limited overlap by converting 3D data to BEV images for 2D feature matching, then transforming correspondences back to 3D, achieving superior accuracy in low-overlap scenarios.

Authors:Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji
Title: Cross-Modal Learning for Music-to-Music-Video Description Generation
Abstract:
Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.
中文: 本研究构建了一个音乐视频描述生成流程,通过在结合音乐与视觉信息的数据集上微调多模态模型,成功将音乐映射至文本领域,并识别出提升描述质量的关键要素。
English: This study develops a pipeline for generating music video descriptions by fine-tuning multimodal models on a dataset that links music with visual elements, effectively mapping music to text and identifying key factors for improving description quality.

Authors:Rong Du, Qingqing Ye, Yue Fu, Haibo Hu
Title: Privacy for Free: Leveraging Local Differential Privacy Perturbed Data from Multiple Services
Abstract:
Local Differential Privacy (LDP) has emerged as a widely adopted privacy-preserving technique in modern data analytics, enabling users to share statistical insights while maintaining robust privacy guarantees. However, current LDP applications assume a single service gathering perturbed information from users. In reality, multiple services may be interested in collecting users' data, which poses privacy burdens to users as more such services emerge. To address this issue, this paper proposes a framework for collecting and aggregating data based on perturbed information from multiple services, regardless of their estimated statistics (e.g., mean or distribution) and perturbation mechanisms. Then for mean estimation, we introduce the Unbiased Averaging (UA) method and its optimized version, User-level Weighted Averaging (UWA). The former utilizes biased perturbed data, while the latter assigns weights to different perturbed results based on perturbation information, thereby achieving minimal variance. For distribution estimation, we propose the User-level Likelihood Estimation (ULE), which treats all perturbed results from a user as a whole for maximum likelihood estimation. Experimental results demonstrate that our framework and constituting methods significantly improve the accuracy of both mean and distribution estimation.
中文摘要:本文提出了一种在本地差分隐私下从多个服务收集和聚合扰动数据的框架,并针对均值和分布估计提出了专门方法,显著提高了估计准确性。
English Summary: This paper introduces a framework for collecting and aggregating perturbed data from multiple services under Local Differential Privacy, proposing specialized methods for mean and distribution estimation that significantly enhance accuracy.

Authors:Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang
Title: KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
Abstract:
Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
中文: 视频对话系统因依赖单一对话类型而限制了其多功能性,为此我们构建了KwaiChat语料库以解决多语言混合类型对话生成问题,其中GPT-4o表现最佳但仍需深入研究。
English: Video-based dialogue systems face limitations in versatility due to their reliance on a single dialogue type, prompting the creation of the KwaiChat corpus to address multilingual mixed-type dialogue generation, with GPT-4o showing the best performance but still requiring further research.

Authors:Ziyue Huang, Yongchao Feng, Shuai Yang, Ziqi Liu, Qingjie Liu, Yunhong Wang
Title: OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
Abstract:
Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7\% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.
中文:OpenRSD是一种创新的开放提示遥感目标检测框架,通过支持多模态提示和多任务检测头解决了现有方法的局限性,在多个数据集上实现了卓越的精度和实时性能。
English: OpenRSD is a novel open-prompt remote sensing object detection framework that addresses limitations in existing methods by supporting multimodal prompts and multi-task detection heads, achieving superior accuracy and real-time performance across diverse datasets.

Authors:Yaoru Li, Shunyu Liu, Tongya Zheng, Mingli Song
Title: Parallelized Planning-Acting for Efficient LLM-based Multi-Agent Systems
Abstract:
Recent advancements in Large Language Model(LLM)-based Multi-Agent Systems(MAS) have demonstrated remarkable potential for tackling complex decision-making tasks. However, existing frameworks inevitably rely on serialized execution paradigms, where agents must complete sequential LLM planning before taking action. This fundamental constraint severely limits real-time responsiveness and adaptation, which is crucial in dynamic environments with ever-changing scenarios. In this paper, we propose a novel parallelized planning-acting framework for LLM-based MAS, featuring a dual-thread architecture with interruptible execution to enable concurrent planning and acting. Specifically, our framework comprises two core threads:(1) a planning thread driven by a centralized memory system, maintaining synchronization of environmental states and agent communication to support dynamic decision-making; and (2) an acting thread equipped with a comprehensive skill library, enabling automated task execution through recursive decomposition. Extensive experiments on challenging Minecraft demonstrate the effectiveness of the proposed framework.
中文: 本文提出了一种基于大语言模型的多智能体系统并行规划-行动新框架,采用双线程架构与可中断执行机制,突破了传统串行执行的限制,有效提升了动态环境中的实时响应能力。
English: This paper introduces a novel parallelized planning-acting framework for LLM-based multi-agent systems, utilizing a dual-thread architecture with interruptible execution to overcome the limitations of traditional serialized approaches and enhance real-time responsiveness in dynamic environments.

Authors:Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang
Title: TransMamba: Flexibly Switching between Transformer and Mamba
Abstract:
Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.
中文摘要:TransMamba通过共享参数矩阵和动态切换机制,将Transformer与Mamba模型相统一,在序列建模中实现了更优的效率和性能,同时为两种架构建立了深度关联。
English Summary: TransMamba is a novel framework that unifies Transformer and Mamba models through shared parameters and dynamic switching mechanisms, achieving superior efficiency and performance in sequence modeling while bridging the two architectures.

Authors:Zhifan Ye, Yonggan Fu, Jingqun Zhang, Leshu Li, Yongan Zhang, Sixu Li, Cheng Wan, Chenxi Wan, Chaojian Li, Sreemanth Prathipati, Yingyan Celine Lin
Title: Gaussian Blending Unit: An Edge GPU Plug-in for Real-Time Gaussian-Based Rendering in AR/VR
Abstract:
The rapidly advancing field of Augmented and Virtual Reality (AR/VR) demands real-time, photorealistic rendering on resource-constrained platforms. 3D Gaussian Splatting, delivering state-of-the-art (SOTA) performance in rendering efficiency and quality, has emerged as a promising solution across a broad spectrum of AR/VR applications. However, despite its effectiveness on high-end GPUs, it struggles on edge systems like the Jetson Orin NX Edge GPU, achieving only 7-17 FPS -- well below the over 60 FPS standard required for truly immersive AR/VR experiences. Addressing this challenge, we perform a comprehensive analysis of Gaussian-based AR/VR applications and identify the Gaussian Blending Stage, which intensively calculates each Gaussian's contribution at every pixel, as the primary bottleneck. In response, we propose a Gaussian Blending Unit (GBU), an edge GPU plug-in module for real-time rendering in AR/VR applications. Notably, our GBU can be seamlessly integrated into conventional edge GPUs and collaboratively supports a wide range of AR/VR applications. Specifically, GBU incorporates an intra-row sequential shading (IRSS) dataflow that shades each row of pixels sequentially from left to right, utilizing a two-step coordinate transformation. When directly deployed on a GPU, the proposed dataflow achieved a non-trivial 1.72x speedup on real-world static scenes, though still falls short of real-time rendering performance. Recognizing the limited compute utilization in the GPU-based implementation, GBU enhances rendering speed with a dedicated rendering engine that balances the workload across rows by aggregating computations from multiple Gaussians. Experiments across representative AR/VR applications demonstrate that our GBU provides a unified solution for on-device real-time rendering while maintaining SOTA rendering quality.
中文: 针对增强与虚拟现实应用中实时渲染的瓶颈,提出的高斯混合单元作为边缘GPU插件,在保持顶尖渲染质量的同时实现了设备端的统一实时渲染解决方案。
English: The Gaussian Blending Unit (GBU) is proposed as an edge GPU plug-in to overcome the bottleneck of real-time photorealistic rendering in AR/VR applications, achieving unified on-device performance while maintaining state-of-the-art quality.

Authors:Shangyi Shi, Husheng Han, Jianan Mu, Xinyao Zheng, Ling Liang, Hang Lu, Zidong Du, Xiaowei Li, Xing Hu, Qi Guo
Title: FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption
Abstract:
Fully Homomorphic Encryption (FHE) imposes substantial memory bandwidth demands, presenting significant challenges for efficient hardware acceleration. Near-memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.12 times of performance improvement over state-of-the-art near-memory architectures, with 95.7% of near-memory bandwidth utilization.
Chinese: FlexMem作为一种近内存加速器,通过可配置计算单元和优化数据流有效处理全同态加密中的不规则内存访问模式,相比现有架构实现了1.12倍的性能提升和95.7%的带宽利用率。
English: FlexMem is a near-memory accelerator that effectively handles irregular memory access patterns in Fully Homomorphic Encryption through configurable computational units and optimized dataflows, achieving 1.12× performance improvement and 95.7% bandwidth utilization over existing architectures.

Authors:Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna
Title: Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
Abstract:
Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the bottleneck kernel of sparse tensor decomposition. In tensor decomposition, spMTTKRP is performed iteratively along all the modes of an input tensor. In this work, we propose a mode-specific tensor layout on GPU that uses multiple tensor copies, where each copy is optimized for a specific mode. The proposed tensor layout increases the data locality of external memory accesses and eliminates the intermediate values communicated between the GPU thread blocks and the GPU global memory. We also propose a tensor partitioning scheme to optimally distribute the total computations among GPU streaming multiprocessors based on the sparsity and the dimensions of the input tensor. Our approach achieves a geometric mean speedup of 2.4x, 7.9x, and 8.9x in total execution time compared with the state-of-the-art GPU baselines.
中文: 本研究提出了一种针对GPU优化的模式特定张量布局和分区方案,显著提升了稀疏张量分解中的数据局部性和计算效率,相比现有技术实现了大幅加速。
English: This work introduces a mode-specific tensor layout and partitioning scheme for GPU-accelerated sparse tensor decomposition, significantly enhancing data locality and computational efficiency to achieve substantial speedups over existing methods.

Authors:Yingmao Miao, Zhanpeng Huang, Rui Han, Zibin Wang, Chenhao Lin, Chao Shen
Title: Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model
Abstract:
While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experimental results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.
中文: 本文提出了一种新颖的饰品虚拟试戴方法,通过迭代式掩码估计和注意力正则化来提升几何对齐与细节保持能力,在应对显著姿态和尺度变化的同时有效维持了饰品的真实外观一致性。
English: This paper introduces a novel method for virtual try-on of ornaments that enhances geometric alignment and detail preservation through iterative mask estimation and attention regularization, effectively handling significant pose and scale variations while maintaining realistic appearance.

Authors:Lin-Han Jia, Lan-Zhe Guo, Zhi Zhou, Si-Ye Han, Zi-Wen Li, Yu-Feng Li
Title: Detecting Scarce and Sparse Anomalous: Solving Dual Imbalance in Multi-Instance Learning
Abstract:
In real-world applications, it is highly challenging to detect anomalous samples with extremely sparse anomalies, as they are highly similar to and thus easily confused with normal samples. Moreover, the number of anomalous samples is inherently scarce. This results in a dual imbalance Multi-Instance Learning (MIL) problem, manifesting at both the macro and micro levels. To address this "needle-in-a-haystack problem", we find that MIL problem can be reformulated as a fine-grained PU learning problem. This allows us to address the imbalance issue in an unbiased manner using micro-level balancing mechanisms. To this end, we propose a novel framework, Balanced Fine-Grained Positive-Unlabeled (BFGPU)-based on rigorous theoretical foundations. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of BFGPU.
中文: 本研究通过将多示例学习重新定义为细粒度PU学习问题,提出了基于严格理论基础的BFGPU框架,利用微观平衡机制有效解决了异常样本稀缺的双重不平衡问题,并在合成和真实数据集上验证了其有效性。
English: The study tackles the dual imbalance in Multi-Instance Learning by reframing it as a fine-grained PU learning problem and introduces the BFGPU framework, which effectively addresses the scarcity of anomalies through micro-level balancing mechanisms, as validated by experiments on synthetic and real-world datasets.

Authors:Lin-Han Jia, Wen-Chao Hu, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li
Title: Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible
Abstract:
The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data, so if we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts-issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic Combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs and introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels for some tasks, while for others the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.
中文: 本文提出验证学习(VL)这一新型神经符号学习范式,通过将基于标签的推理过程转化为无标签验证过程,仅依赖未标注数据和规则验证函数,以约束优化问题形式化并采用动态组合排序算法提升效率、避免捷径,在多项无监督任务中实现显著性能提升。
English: This paper introduces Verification Learning (VL), a novel neuro-symbolic paradigm that replaces label-based reasoning with label-free verification using unlabeled data and rule-checking functions, formalized as a constraint optimization problem and enhanced with a dynamic combinatorial sorting algorithm to improve efficiency and prevent shortcuts across various tasks.

Authors:Chenhao Lin, Chenyang Zhao, Shiwei Wang, Longtian Wang, Chao Shen, Zhengyu Zhao
Title: Revisiting Training-Inference Trigger Intensity in Backdoor Attacks
Abstract:
Backdoor attacks typically place a specific trigger on certain training data, such that the model makes prediction errors on inputs with that trigger during inference. Despite the core role of the trigger, existing studies have commonly believed a perfect match between training-inference triggers is optimal. In this paper, for the first time, we systematically explore the training-inference trigger relation, particularly focusing on their mismatch, based on a Training-Inference Trigger Intensity Manipulation (TITIM) workflow. TITIM specifically investigates the training-inference trigger intensity, such as the size or the opacity of a trigger, and reveals new insights into trigger generalization and overfitting. These new insights challenge the above common belief by demonstrating that the training-inference trigger mismatch can facilitate attacks in two practical scenarios, posing more significant security threats than previously thought. First, when the inference trigger is fixed, using training triggers with mixed intensities leads to stronger attacks than using any single intensity. For example, on CIFAR-10 with ResNet-18, mixing training triggers with 1.0 and 0.1 opacities improves the worst-case attack success rate (ASR) (over different testing opacities) of the best single-opacity attack from 10.61\% to 92.77\%. Second, intentionally using certain mismatched training-inference triggers can improve the attack stealthiness, i.e., better bypassing defenses. For example, compared to the training/inference intensity of 1.0/1.0, using 1.0/0.7 decreases the area under the curve (AUC) of the Scale-Up defense from 0.96 to 0.62, while maintaining a high attack ASR (99.65\% vs. 91.62\%). The above new insights are validated to be generalizable across different backdoor attacks, models, datasets, tasks, and (digital/physical) domains.
中文摘要:本研究通过系统实验挑战了传统观念,证明后门攻击中故意使用不匹配的训练-推理触发器反而能显著提升攻击效果和隐蔽性,而非追求完美的触发器对齐。
English Summary: This study challenges the conventional belief that perfect training-inference trigger alignment is optimal for backdoor attacks, demonstrating through systematic experiments that intentional trigger mismatches can significantly enhance both attack effectiveness and stealthiness across various scenarios.

Authors:HyunJin Kim, Xiaoyuan Yi, Jing Yao, Muhua Huang, JinYeong Bak, James Evans, Xing Xie
Title: Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
Abstract:
The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.
中文摘要:本文认为实现超级对齐——确保人工超级智能与人类价值观一致——既是可行的也是必要的,并提出通过任务能力与价值遵从的双重优化作为切实可行的实现路径。
English Summary: This paper argues that achieving superalignment—ensuring artificial superintelligence aligns with human values—is both feasible and essential, proposing a dual optimization approach for task competence and value conformity as a practical path forward.

Authors:Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
Title: DreamRelation: Relation-Centric Video Customization
Abstract:
Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.
Chinese: DreamRelation通过关系解耦学习和关系动态增强的新方法,实现了优于现有技术的关系视频定制性能。
English: DreamRelation introduces a novel approach for relational video customization by decoupling relations from subject appearances and enhancing relational dynamics, achieving superior performance over existing methods.

Authors:Long Peng, Anran Wu, Wenbo Li, Peizhe Xia, Xueyuan Dai, Xinjie Zhang, Xin Di, Haoze Sun, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha
Title: Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling
Abstract:
Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors using a single model, addressing the limitations of traditional SR methods constrained to fixed-scale factors (\textit{e.g.}, $\times$ 2). Recent advances leveraging implicit neural representation (INR) have achieved great progress by modeling coordinate-to-pixel mappings. However, the efficiency of these methods may suffer from repeated upsampling and decoding, while their reconstruction fidelity and quality are constrained by the intrinsic representational limitations of coordinate-based functions. To address these challenges, we propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting. This approach eliminates the need for time-consuming upsampling and decoding, enabling extremely fast arbitrary-scale super-resolution. Once the Gaussian field is built in a single pass, ContinuousSR can perform arbitrary-scale rendering in just 1ms per scale. Our method introduces several key innovations. Through statistical ana
中文摘要:提出的ContinuousSR框架采用高斯泼溅技术,通过从低分辨率输入直接重建连续高分辨率信号,实现了快速任意尺度图像超分辨率,无需重复上采样过程。
English Summary: The proposed ContinuousSR framework uses Gaussian Splatting to achieve fast arbitrary-scale image super-resolution by directly reconstructing continuous high-resolution signals from low-resolution inputs, eliminating repetitive upsampling processes.

Authors:Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello
Title: Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
Abstract:
The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.
中文: 本研究提出优化的软件内核和轻量级ISA扩展,用于在RISC-V微控制器上加速剪枝深度神经网络,实现了显著的速度提升,同时仅带来微小面积开销和精度损失。
English: This study introduces optimized software kernels and a lightweight ISA extension to accelerate pruned DNNs on RISC-V MCUs, achieving significant speed improvements with minimal area overhead and accuracy loss.

Authors:Hongjia Zhai, Boming Zhao, Hai Li, Xiaokun Pan, Yijia He, Zhaopeng Cui, Hujun Bao, Guofeng Zhang
Title: NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features
Abstract:
Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D-3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a 3$\times$ faster training speed and a 45$\times$ reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: \href{https://zju3dv.github.io/neuraloc}{https://zju3dv.github.io/neuraloc}
Chinese: 本文提出了一种基于神经隐式地图与互补特征的高效视觉定位方法,通过结合几何约束和语义上下文特征,实现了3倍训练加速和45倍存储缩减,性能优于现有基于NeRF的方法。
English: This paper introduces an efficient visual localization method using neural implicit maps with complementary features, which accelerates training by 3 times, reduces storage by 45 times, and outperforms existing NeRF-based approaches by integrating geometric constraints and semantic contextual features.

Authors:Mattia Sinigaglia, Amirhossein Kiamarzi, Marco Bertuletti, Luigi Ghionda, Mattia Orlandi, Riccardo Tedeschi, Aurora Di Giampietro, Yvan Tortorella, Luca Bertaccini, Simone Benatti, Giuseppe Tagliavini, Luca Benini, Francesco Conti, Davide Rossi
Title: Maestro: A 302 GFLOPS/W and 19.8GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge Computing
Abstract:
Most Wearable Ultrasound (WUS) devices lack the computational power to process signals at the edge, instead relying on remote offload, which introduces latency, high power consumption, and privacy concerns. We present Maestro, a RISC-V SoC with unified Vector-Tensor Unit (VTU) and memory-coupled Fast Fourier Transform (FFT) accelerators targeting edge processing for wearable ultrasound devices, fabricated using low-cost TSMC 65nm CMOS technology. The VTU achieves peak 302GFLOPS/W and 19.8GFLOPS at FP16, while the multi-precision 16/32-bit floating-point FFT accelerator delivers peak 60.6GFLOPS/W and 3.6GFLOPS at FP16, We evaluate Maestro on a US-based gesture recognition task, achieving 1.62GFLOPS in signal processing at 26.68GFLOPS/W, and 19.52GFLOPS in Convolutional Neural Network (CNN) workloads at 298.03GFLOPS/W. Compared to a state-of-the-art SoC with a similar mission profile, Maestro achieves a 5x speedup while consuming only 12mW, with an energy consumption of 2.5mJ in a wearable US channel preprocessing and ML-based postprocessing pipeline.
中文: Maestro是一款集成向量张量和FFT加速器的RISC-V片上系统,专为可穿戴超声设备边缘计算设计,在仅12mW功耗下实现比现有方案快5倍的性能表现。
English: Maestro is a RISC-V SoC with integrated vector-tensor and FFT accelerators designed for wearable ultrasound edge processing, delivering up to 5x faster performance at just 12mW power consumption compared to existing solutions.

Authors:Yihao Huang, Xin Luo, Qing Guo, Felix Juefei-Xu, Xiaojun Jia, Weikai Miao, Geguang Pu, Yang Liu
Title: Scale-Invariant Adversarial Attack against Arbitrary-scale Super-resolution
Abstract:
The advent of local continuous image function (LIIF) has garnered significant attention for arbitrary-scale super-resolution (SR) techniques. However, while the vulnerabilities of fixed-scale SR have been assessed, the robustness of continuous representation-based arbitrary-scale SR against adversarial attacks remains an area warranting further exploration. The elaborately designed adversarial attacks for fixed-scale SR are scale-dependent, which will cause time-consuming and memory-consuming problems when applied to arbitrary-scale SR. To address this concern, we propose a simple yet effective ``scale-invariant'' SR adversarial attack method with good transferability, termed SIAGT. Specifically, we propose to construct resource-saving attacks by exploiting finite discrete points of continuous representation. In addition, we formulate a coordinate-dependent loss to enhance the cross-model transferability of the attack. The attack can significantly deteriorate the SR images while introducing imperceptible distortion to the targeted low-resolution (LR) images. Experiments carried out on three popular LIIF-based SR approaches and four classical SR datasets show remarkable attack performance and transferability of SIAGT.
Chinese: 本文提出SIAGT,一种尺度不变的对抗攻击方法,通过利用连续表示和坐标相关损失,有效针对任意尺度超分辨率模型,在多个模型和数据集上展现出卓越的攻击性能和迁移能力。
English: This paper introduces SIAGT, a scale-invariant adversarial attack method that efficiently targets arbitrary-scale super-resolution models by exploiting continuous representations and a coordinate-dependent loss, demonstrating strong performance and transferability across multiple models and datasets.

Authors:Iraklis Premptis, Maria Lymperaiou, Giorgos Filandrianos, Orfeas Menis Mastromichalakis, Athanasios Voulodimos, Giorgos Stamou
Title: AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
Abstract:
The Unlearning Sensitive Content from Large Language Models task aims to remove targeted datapoints from trained models while minimally affecting their general knowledge. In our work, we leverage parameter-efficient, gradient-based unlearning using low-rank (LoRA) adaptation and layer-focused fine-tuning. To further enhance unlearning effectiveness, we employ data chunking, splitting forget data into disjoint partitions and merging them with cyclically sampled retain samples at a pre-defined ratio. Our task-agnostic method achieves an outstanding forget-retain balance, ranking first on leaderboards and significantly outperforming baselines and competing systems.
中文摘要:本研究提出一种参数高效的遗忘方法,通过LoRA适配和分层微调,结合数据分块与循环采样技术,在消除敏感数据的同时出色保持了模型性能,显著优于现有方案。
English Summary: This study introduces a parameter-efficient unlearning method using LoRA adaptation and layer-focused fine-tuning, enhanced by data chunking and cyclic sampling, which achieves superior performance in removing sensitive data while preserving model knowledge.

Authors:Dimitra Karkani, Maria Lymperaiou, Giorgos Filandrianos, Nikolaos Spanos, Athanasios Voulodimos, Giorgos Stamou
Title: AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection
Abstract:
Multilingual hallucination detection stands as an underexplored challenge, which the Mu-SHROOM shared task seeks to address. In this work, we propose an efficient, training-free LLM prompting strategy that enhances detection by translating multilingual text spans into English. Our approach achieves competitive rankings across multiple languages, securing two first positions in low-resource languages. The consistency of our results highlights the effectiveness of our translation strategy for hallucination detection, demonstrating its applicability regardless of the source language.
Chinese: 本研究提出一种无需训练的提示方法,通过将多语言文本翻译成英语,在多种语言的幻觉检测中取得了领先成果,尤其在低资源语言中表现突出。
English: The study introduces a training-free prompting method that translates multilingual text into English, achieving top results in hallucination detection across various languages, especially in low-resource settings.

Authors:Yingji Zhong, Kaichen Zhou, Zhihao Li, Lanqing Hong, Zhenguo Li, Dan Xu
Title: Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views
Abstract:
Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semantic-relevant predictions. The overall semantic guidance is embedded into a self-improved pipeline. We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.
中文: 本文提出一种方法,通过引入渲染密集视图的语义指导来增强神经辐射场(NeRF)在稀疏输入下的性能,结合监督级和特征级指导提升鲁棒性,并在更具挑战性的室内基准测试中取得优越效果。
English: This paper introduces a method that enhances Neural Radiance Fields (NeRF) performance under sparse inputs by incorporating semantic guidance from rendered dense views, utilizing both supervision and feature levels to improve robustness and achieve superior results on a new challenging indoor benchmark.

Authors:Wenxin Zhao, Fangyu Yu, Peng Zhang, Hansu Gu, Lin Wang, Siyuan Qiao, Tun Lu, Ning Gu
Title: YouthCare: Building a Personalized Collaborative Video Censorship Tool to Support Parent-Child Joint Media Engagement
Abstract:
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and efficiently in JME. For this goal, we first conducted a formative study to identify the needs and expectations of teenagers and parents for such a system. Based on the findings, we designed YouthCare, a personalized collaborative video censorship tool that supports parents and children to collaboratively filter out inappropriate content and select appropriate content in JME. An evaluation with 10 parent-child pairs demonstrated YouthCare's several strengths in supporting video censorship, while also highlighting some potential problems. These findings inspire us to propose several insights for the future design of parent-child collaborative JME systems.
中文摘要:为弥补亲子共同媒介参与中家长时间与经验的不足,本研究开发了YouthCare自动化工具,通过协同过滤机制帮助家庭管理视频内容,实验验证其有效性的同时为未来设计提供了改进方向。
English Summary: To address the limitations of time and expertise in Parent-Child Joint Media Engagement, this study designed YouthCare, an automated tool that helps families collaboratively filter online video content, which was evaluated positively while revealing areas for future improvement.

Authors:Maria Lymperaiou, Giorgos Stamou
Title: Conceptual Contrastive Edits in Textual and Vision-Language Retrieval
Abstract:
As deep learning models grow in complexity, achieving model-agnostic interpretability becomes increasingly vital. In this work, we employ post-hoc conceptual contrastive edits to expose noteworthy patterns and biases imprinted in representations of retrieval models. We systematically design optimal and controllable contrastive interventions targeting various parts of speech, and effectively apply them to explain both linguistic and visiolinguistic pre-trained models in a black-box manner. Additionally, we introduce a novel metric to assess the per-word impact of contrastive interventions on model outcomes, providing a comprehensive evaluation of each intervention's effectiveness.
中文摘要:本研究采用事后概念对比编辑方法,揭示检索模型中隐含的模式与偏见,通过系统性干预措施和新评估指标,全面衡量干预对模型输出的影响。
English Summary: This study introduces post-hoc conceptual contrastive edits to uncover patterns and biases in retrieval models, employing systematic interventions and a novel metric for evaluating their impact on model outputs.

Authors:Xuechen Zhang, Changyang He, Peng Zhang, Hansu Gu, Ning Gu, Qi Shen, Zhan Hu, Tun Lu
Title: RemiHaven: Integrating "In-Town" and "Out-of-Town" Peers to Provide Personalized Reminiscence Support for Older Drifters
Abstract:
With increasing social mobility and an aging society, more older adults in China are migrating to new cities, known as "older drifters." Due to fewer social connections and cultural adaptation challenges, they face negative emotions such as loneliness and depression. While reminiscence-based interventions have been used to improve older adults' psychological well-being, challenges such as the lack of tangible materials and limited social resources constrain the feasibility of traditional reminiscence approaches for older drifters. To address this challenge, we designed RemiHaven, a personalized reminiscence support tool based on a two-phase formative study. It integrates "In-Town" and "Out-of-Town" peer agents to enhance personalization, engagement, and emotional resonance in the reminiscence process, powered by Multimodal Large Language Models (MLLMs). Our evaluations show RemiHaven's strengths in supporting reminiscence while identifying potential challenges. We conclude by offering insights for the future design of reminiscence support tools for older migrants.
Chinese: 为解决中国老年漂群体的孤独抑郁问题,RemiHaven这一基于多模态大语言模型和同伴代理的个性化怀旧工具被开发出来,并通过怀旧过程被证明能有效支持情感健康。
English: To address the loneliness and depression faced by older drifters in China, RemiHaven, a personalized reminiscence tool using multimodal large language models and peer agents, was developed and shown to effectively support emotional well-being through reminiscence.

Authors:Jinhong Wang, Jintai Chen, Jian Liu, Dongqi Tang, Danny Z. Chen, Jian Wu
Title: A Survey on Ordinal Regression: Applications, Advances and Prospects
Abstract:
Ordinal regression refers to classifying object instances into ordinal categories. Ordinal regression is crucial for applications in various areas like facial age estimation, image aesthetics assessment, and even cancer staging, due to its capability to utilize ordered information effectively. More importantly, it also enhances model interpretation by considering category order, aiding the understanding of data trends and causal relationships. Despite significant recent progress, challenges remain, and further investigation of ordinal regression techniques and applications is essential to guide future research. In this survey, we present a comprehensive examination of advances and applications of ordinal regression. By introducing a systematic taxonomy, we meticulously classify the pertinent techniques and applications into three well-defined categories based on different strategies and objectives: Continuous Space Discretization, Distribution Ordering Learning, and Ambiguous Instance Delving. This categorization enables a structured exploration of diverse insights in ordinal regression problems, providing a framework for a more comprehensive understanding and evaluation of this field and its related applications. To our best knowledge, this is the first systematic survey of ordinal regression, which lays a foundation for future research in this fundamental and generic domain.
中文摘要:本综述系统梳理了序数回归的研究进展,通过建立三大技术分类框架,为该基础领域的深入研究和应用发展奠定了重要基础。
English Summary: This survey provides a comprehensive review of ordinal regression, categorizing techniques into three strategies to enhance understanding and guide future research in this fundamental field.

Authors:Maria Lymperaiou, Giorgos Filandrianos, Angeliki Dimitriou, Athanasios Voulodimos, Giorgos Stamou
Title: HalCECE: A Framework for Explainable Hallucination Detection through Conceptual Counterfactuals in Image Captioning
Abstract:
In the dynamic landscape of artificial intelligence, the exploration of hallucinations within vision-language (VL) models emerges as a critical frontier. This work delves into the intricacies of hallucinatory phenomena exhibited by widely used image captioners, unraveling interesting patterns. Specifically, we step upon previously introduced techniques of conceptual counterfactual explanations to address VL hallucinations. The deterministic and efficient nature of the employed conceptual counterfactuals backbone is able to suggest semantically minimal edits driven by hierarchical knowledge, so that the transition from a hallucinated caption to a non-hallucinated one is performed in a black-box manner. HalCECE, our proposed hallucination detection framework is highly interpretable, by providing semantically meaningful edits apart from standalone numbers, while the hierarchical decomposition of hallucinated concepts leads to a thorough hallucination analysis. Another novelty tied to the current work is the investigation of role hallucinations, being one of the first works to involve interconnections between visual concepts in hallucination detection. Overall, HalCECE recommends an explainable direction to the crucial field of VL hallucination detection, thus fostering trustworthy evaluation of current and future VL systems.
中文: 本研究提出HalCECE框架,通过分层概念反事实方法对视觉语言模型中的幻觉现象进行可解释检测与语义修正,为可信人工智能评估开辟了新方向。
English: This study introduces HalCECE, an interpretable framework that uses hierarchical conceptual counterfactuals to detect and correct hallucinations in vision-language models through semantic edits, advancing trustworthy AI evaluation.

Authors:Andreas Evangelatos, Giorgos Filandrianos, Maria Lymperaiou, Athanasios Voulodimos, Giorgos Stamou
Title: AILS-NTUA at SemEval-2025 Task 8: Language-to-Code prompting and Error Fixing for Tabular Question Answering
Abstract:
In this paper, we present our submission to SemEval-2025 Task 8: Question Answering over Tabular Data. This task, evaluated on the DataBench dataset, assesses Large Language Models' (LLMs) ability to answer natural language questions over structured data while addressing topic diversity and table size limitations in previous benchmarks. We propose a system that employs effective LLM prompting to translate natural language queries into executable code, enabling accurate responses, error correction, and interpretability. Our approach ranks first in both subtasks of the competition in the proprietary model category, significantly outperforming the organizer's baseline.
中文摘要:我们的系统在SemEval-2025表格问答任务中荣获第一名,通过先进的LLM提示技术将自然语言查询转化为可执行代码,实现了精准回答并具备良好可解释性。
English Summary: Our system won first place in SemEval-2025's table-based question answering task by using advanced LLM prompting to convert natural language queries into executable code for accurate and interpretable responses.

Authors:Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State, Tegawendé F. Bissyandé, Jacques Klein
Title: Is Small Language Model the Silver Bullet to Low-Resource Languages Machine Translation?
Abstract:
Low-resource languages (LRLs) lack sufficient linguistic resources and are underrepresented in benchmark datasets, resulting in persistently lower translation quality than high-resource languages, especially in privacy-sensitive and resource-limited contexts. Firstly, this study systematically evaluates state-of-the-art smaller Large Language Models in 200 languages using the FLORES-200 benchmark, highlighting persistent deficiencies and disparities in the translation of LRLs. To mitigate these limitations, we investigate knowledge distillation from large pre-trained teacher models to Small Language Models (SLMs) through supervised fine-tuning. The results show substantial improvements; for example, the translation performance of English to Luxembourgish (EN to LB), measured by the LLM-as-a-Judge score, increases from 0.36 to 0.89 in the validation set for Llama-3.2-3B. We further investigate various fine-tuning configurations and tasks to clarify the trade-offs between data scale and training efficiency, verify that the model retains its general capabilities without significant catastrophic forgetting after training, and explore the distillation benefits to other LRLs on SLMs (Khasi, Assamese, and Ukrainian). In general, this work exposes the limitations and fairness issues of current SLMs in LRL translation and systematically explores the potential of using the distillation of knowledge from large to small models, offering practical, empirically grounded recommendations to improve LRL translation systems
中文摘要:本研究揭示了小语言模型在低资源语言翻译中的局限性,并通过从大型教师模型进行知识蒸馏的方法,显著提升了翻译性能且保持了模型的通用能力。
English Summary: This study identifies the limitations of small language models in translating low-resource languages and demonstrates that knowledge distillation from large teacher models significantly enhances translation performance while maintaining general capabilities.

Authors:David Wong, Bin Wang, Gorkem Durak, Marouane Tliba, Akshay Chaudhari, Aladine Chetouani, Ahmet Enis Cetin, Cagdas Topel, Nicolo Gennaro, Camila Lopes Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H. Miller, Amir Borhani, Hatice Savas, Eric Hart, Drew Torigian, Jayaram K. Udupa, Elizabeth Krupinski, Ulas Bagci
Title: Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging
Abstract:
The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.
中文: GazeVal框架通过结合放射科医师的眼动追踪数据和直接评估来检验合成医学图像质量,实验显示96.6%的AI生成图像被识别为虚假,揭示了当前生成算法在临床真实性方面的不足。
English: GazeVal is a novel framework that integrates radiologists' eye-tracking data with direct evaluations to assess synthetic medical image quality, revealing through experiments that 96.6% of AI-generated images were identified as fake, highlighting the gap between computational metrics and clinical authenticity.

Authors:Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, Tong Zhang
Title: ASGO: Adaptive Structured Gradient Optimization
Abstract:
Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block-wise diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on the convergence theory, we further demonstrate that ASGO can benefit from the low-rank and block-wise diagonal properties. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks.
中文: 本文提出的ASGO优化算法通过自适应结构化预处理器,有效利用深度神经网络中梯度的低秩性和Hessian矩阵的块对角特性,在语言模型任务中实现了更优的收敛速度和实证性能。
English: The paper introduces ASGO, a novel optimization algorithm that leverages the low-rank gradients and block-wise diagonal Hessians in deep neural networks through an adaptive structured preconditioner, achieving superior convergence rates and empirical effectiveness in language modeling tasks.

Authors:Fred Philippy, Siwen Guo, Cedric Lothritz, Jacques Klein, Tegawendé F. Bissyandé
Title: Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning
Abstract:
In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.
Chinese: RoSPrompt是一种轻量级方法,通过训练软提示增强小型多语言预训练模型的跨语言零样本分类能力,无需大量微调即可实现从高资源语言向低资源语言的有效知识迁移。
English: RoSPrompt is a lightweight method that trains soft prompts to enhance cross-lingual zero-shot classification for small multilingual pretrained models, enabling effective knowledge transfer from high-resource to low-resource languages without extensive fine-tuning.

Authors:Jiaming Ji, Xinyu Chen, Rui Pan, Conghui Zhang, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Yida Tang, Sirui Han, Yike Guo, Yaodong Yang
Title: Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Abstract:
Multimodal large language models (MLLMs) are essential for building general-purpose AI assistants; however, they pose increasing safety risks. How can we ensure safety alignment of MLLMs to prevent undesired behaviors? Going further, it is critical to explore how to fine-tune MLLMs to preserve capabilities while meeting safety constraints. Fundamentally, this challenge can be formulated as a min-max optimization problem. However, existing datasets have not yet disentangled single preference signals into explicit safety constraints, hindering systematic investigation in this direction. Moreover, it remains an open question whether such constraints can be effectively incorporated into the optimization process for multi-modal models. In this work, we present the first exploration of the Safe RLHF-V -- the first multimodal safety alignment framework. The framework consists of: $\mathbf{(I)}$ BeaverTails-V, the first open-source dataset featuring dual preference annotations for helpfulness and safety, supplemented with multi-level safety labels (minor, moderate, severe); $\mathbf{(II)}$ Beaver-Guard-V, a multi-level guardrail system to proactively defend against unsafe queries and adversarial attacks. Applying the guard model over five rounds of filtering and regeneration significantly enhances the precursor model's overall safety by an average of 40.9%. $\mathbf{(III)}$ Based on dual preference, we initiate the first exploration of multi-modal safety alignment within a constrained optimization. Experimental results demonstrate that Safe RLHF effectively improves both model helpfulness and safety. Specifically, Safe RLHF-V enhances model safety by 34.2% and helpfulness by 34.3%.
中文: 本研究提出了首个多模态安全对齐框架Safe RLHF-V,通过包含双重标注的数据集和多级防护系统,在约束优化中显著提升了多模态大语言模型的安全性和实用性。
English: This work introduces Safe RLHF-V, the first multimodal safety alignment framework featuring a dual-annotated dataset and multi-level guardrail system, which significantly enhances both safety and helpfulness in multimodal language models through constrained optimization.

Authors:Javier J. Poveda Rodrigo, Mohamed Amine Ahmdi, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini
Title: V-Seek: Accelerating LLM Reasoning on Open-hardware Server-class RISC-V Platforms
Abstract:
The recent exponential growth of Large Language Models (LLMs) has relied on GPU-based systems. However, CPUs are emerging as a flexible and lower-cost alternative, especially when targeting inference and reasoning workloads. RISC-V is rapidly gaining traction in this area, given its open and vendor-neutral ISA. However, the RISC-V hardware for LLM workloads and the corresponding software ecosystem are not fully mature and streamlined, given the requirement of domain-specific tuning. This paper aims at filling this gap, focusing on optimizing LLM inference on the Sophon SG2042, the first commercially available many-core RISC-V CPU with vector processing capabilities. On two recent state-of-the-art LLMs optimized for reasoning, DeepSeek R1 Distill Llama 8B and DeepSeek R1 Distill QWEN 14B, we achieve 4.32/2.29 token/s for token generation and 6.54/3.68 token/s for prompt processing, with a speed up of up 2.9x/3.0x compared to our baseline.
中文: 本文针对基于RISC-V的算能SG2042处理器优化大语言模型推理,在两种先进模型上实现了令牌生成与提示处理的显著加速。
English: This paper addresses the optimization of Large Language Model inference on the RISC-V-based Sophon SG2042 CPU, achieving significant speed improvements in token generation and prompt processing for two advanced models.

Authors:Qingyu Shi, Jianzong Wu, Jinbin Bai, Jiangning Zhang, Lu Qi, Yunhai Tong, Xiangtai Li
Title: Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer
Abstract:
The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within the 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.
中文: 本文提出DeT方法,通过引入时序核来解耦视频扩散变换器中的运动与外观,并采用显式监督增强运动一致性,在新型基准MTBench上验证了其在运动保真度与编辑保真度的最优平衡。
English: This paper introduces DeT, a method that enhances motion transfer in video Diffusion Transformers by incorporating a temporal kernel to decouple motion from appearance and employing explicit supervision for improved motion consistency, validated through a new benchmark MTBench showing superior performance.

Authors:Yufei Zhu, Yiming Zhong, Zemin Yang, Peishan Cong, Jingyi Yu, Xinge Zhu, Yuexin Ma
Title: EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment
Abstract:
Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
中文摘要:EvolvingGrasp提出了一种进化式抓取生成方法,通过偏好对齐和物理感知优化持续提升机器人抓取能力,在仿真和真实场景中均实现了最优性能。
English Summary: EvolvingGrasp introduces an evolutionary grasp generation method that continuously improves robotic grasping through preference alignment and physics-aware optimization, achieving state-of-the-art performance in both simulation and real-world scenarios.

Authors:Glenn Grubert, Florian Barthel, Anna Hilsmann, Peter Eisert
Title: Improving Adaptive Density Control for 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
中文: 3D高斯泼溅在基元数量管理上面临渲染伪影问题,但通过三项自适应密度控制改进——修正场景范围计算、指数梯度阈值和显著性感知剪枝——提升了渲染质量、加速训练超两倍且兼容现有衍生研究。
English: 3D Gaussian Splatting faces challenges with primitive management, leading to rendering artifacts, but three improvements to its adaptive density control mechanism enhance rendering quality, accelerate training by over twofold, and maintain compatibility with existing derivative works.

Authors:Iryna Repinetska, Anna Hilsmann, Peter Eisert
Title: Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios
Abstract:
Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
中文: 神经辐射场在渲染中表现出色,但在低纹理室内区域会产生伪影,因此我们提出一种利用密集深度先验和新颖深度损失的方法,以提升此类环境中的视觉保真度。
English: Neural Radiance Fields (NeRFs) excel in rendering but produce artifacts in low-textured indoor areas, so we propose a method using dense depth priors and a novel depth loss to enhance visual fidelity in such environments.

Authors:Qin Liu, Wenxuan Zhou, Nan Xu, James Y. Huang, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
Title: MetaScale: Test-Time Scaling with Evolving Meta-Thoughts
Abstract:
One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts -- adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.
中文:METASCALE框架通过推理过程中动态选择和进化元思维,解决了大语言模型在适应认知策略方面的局限性,从而在不同任务中提升了准确性和泛化能力。
English: The METASCALE framework addresses the limitation of large language models in adapting cognitive strategies by dynamically selecting and evolving meta-thoughts during inference, resulting in improved accuracy and generalization across diverse tasks.

Authors:Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo
Title: ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs
Abstract:
Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.
中文总结:本研究分析了不同思维模式对大型语言模型性能的影响,发现结构化思维有益于较小模型但可能阻碍较大模型,而无结构化独白在所有规模模型中均表现良好,并发布了包含五种思维类型的新2.1万条数据集以支持后续研究。
English Summary: This study analyzes how different thinking patterns affect large language model performance, finding that structured thinking benefits smaller models but can hinder larger ones, while unstructured monologue works well across all sizes, and introduces a new 21k dataset with five thinking types to support further research.

Authors:Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
Title: Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Abstract:
Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.
中文: 近期视觉语言模型因监督安全微调产生的伪相关性易生成有害内容,而机器遗忘技术通过直接消除有害知识并保留模型能力,有效解决了这一问题。
English: Recent vision language models show vulnerability to generating harmful content due to spurious correlations from supervised safety fine-tuning, but machine unlearning effectively mitigates these risks by removing harmful knowledge while preserving model capabilities.

Authors:Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, Muhao Chen
Title: Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
Abstract:
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to excel in vision-language tasks such as visual question answering (VQA). To improve fine-grained visual reasoning, recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model. However, this approach significantly increases the number of visual tokens, leading to inefficiency and potential distractions for the LLM. To address the generalization challenges of image representation in VLMs, we propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details. Our method leverages textual semantics to identify key visual areas, improving VQA performance without requiring any retraining of the VLM. Additionally, it incorporates textual signals into the visual encoding process, enhancing both efficiency and effectiveness. The proposed method, SEMCLIP, strengthens the visual understanding of a 7B VLM, LLaVA-1.5 by 3.3% on average across 7 benchmarks, and particularly by 5.3% on the challenging detailed understanding benchmark V*.
中文:提出的SEMCLIP框架通过利用文本语义识别关键视觉区域,无需重新训练模型即可提升视觉语言模型的细粒度视觉推理效率,并在多个基准测试中显著提高性能表现。
English: The proposed SEMCLIP framework enhances vision-language models by using textual semantics to identify key visual areas, improving fine-grained visual reasoning efficiency and boosting performance on benchmarks without requiring model retraining.

Authors:Joshua Liu, Aarav Jain, Soham Takuri, Srihan Vege, Aslihan Akalin, Kevin Zhu, Sean O'Brien, Vasu Sharma
Title: TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models
Abstract:
Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.
中文: 大语言模型的快速发展凸显了奉承问题,即模型优先考虑迎合用户而非事实准确性,为此创建了TRUTH DECAY基准来评估多轮对话中的奉承行为并测试缓解策略。
English: The rapid advancement of large language models has highlighted the issue of sycophancy, where models prioritize agreement over accuracy, leading to the creation of the TRUTH DECAY benchmark to assess this behavior in multi-turn dialogues and test reduction strategies.

Authors:Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Sihan Liu, Thanasis Pittas
Title: Batch List-Decodable Linear Regression via Higher Moments
Abstract:
We study the task of list-decodable linear regression using batches. A batch is called clean if it consists of i.i.d. samples from an unknown linear regression distribution. For a parameter $α\in (0, 1/2)$, an unknown $α$-fraction of the batches are clean and no assumptions are made on the remaining ones. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. [DJKS23] gave an efficient algorithm, under natural distributional assumptions, with the following guarantee. Assuming that the batch size $n$ satisfies $n \geq \tildeΩ(α^{-1})$ and the number of batches is $m = \mathrm{poly}(d, n, 1/α)$, their algorithm runs in polynomial time and outputs a list of $O(1/α^2)$ vectors at least one of which is $\tilde{O}(α^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial time algorithm with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded. Specifically, for any constant $δ>0$, as long as the batch size is $n \geq Ω_δ(α^{-δ})$ and the degree-$Θ(1/δ)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \mathrm{poly}((dn)^{1/δ}, 1/α)$ batches, runs in polynomial-time, and outputs an $O(1/α)$-sized list of vectors one of which is $O(α^{-δ/2}/\sqrt{n})$ close to the target. That is, our algorithm achieves substantially smaller minimum batch size and final error, while achieving the optimal list size. Our approach uses higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.
中文: 本文提出了一种新的多项式时间算法用于批量列表可解码线性回归,在协变量高阶矩满足SoS可证明有界的假设下,该算法在保持最优列表大小的同时,显著降低了最小批量大小要求并减小了最终误差。
English: This paper presents a new polynomial-time algorithm for list-decodable linear regression with batches, achieving significantly smaller batch size requirements and final error while maintaining optimal list size under the assumption of SoS-certifiably bounded higher moments of covariates.

Authors:Huaying Yuan, Zheng Liu, Minghao Qin, Hongjin Qian, Yan Shu, Zhicheng Dou, Ji-Rong Wen, Nicu Sebe
Title: Memory-enhanced Retrieval Augmentation for Long Video Understanding
Abstract:
Efficient long-video understanding~(LVU) remains a challenging task in computer vision. Current long-context vision-language models~(LVLMs) suffer from information loss due to compression and brute-force downsampling. While retrieval-augmented generation (RAG) methods mitigate this issue, their applicability is limited due to explicit query dependency. To overcome this challenge, we introduce a novel memory-enhanced RAG-based approach called MemVid, which is inspired by the cognitive memory of human beings. Our approach operates in four basic steps: 1) memorizing holistic video information, 2) reasoning about the task's information needs based on memory, 3) retrieving critical moments based on the information needs, and 4) focusing on the retrieved moments to produce the final answer. To enhance the system's memory-grounded reasoning capabilities while achieving optimal end-to-end performance, we propose a curriculum learning strategy. This approach begins with supervised learning on well-annotated reasoning results, then progressively explores and reinforces more plausible reasoning outcomes through reinforcement learning. We perform extensive evaluations on popular LVU benchmarks, including MLVU, VideoMME and LVBench. In our experiments, MemVid demonstrates superior efficiency and effectiveness compared to both LVLMs and RAG methods.
中文摘要:MemVid提出了一种基于记忆增强的检索增强生成方法,通过记忆、推理、检索和聚焦四个步骤模拟人类认知,并结合课程学习策略,在长视频理解基准测试中展现出卓越性能。
English Summary: MemVid introduces a memory-enhanced RAG approach for long-video understanding that mimics human cognition through four steps—memorizing, reasoning, retrieving, and focusing—enhanced by curriculum learning, achieving superior performance on benchmarks.

Authors:Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
Title: Scaling Probabilistic Circuits via Data Partitioning
Abstract:
Probabilistic circuits (PCs) enable us to learn joint distributions over a set of random variables and to perform various probabilistic queries in a tractable fashion. Though the tractability property allows PCs to scale beyond non-tractable models such as Bayesian Networks, scaling training and inference of PCs to larger, real-world datasets remains challenging. To remedy the situation, we show how PCs can be learned across multiple machines by recursively partitioning a distributed dataset, thereby unveiling a deep connection between PCs and federated learning (FL). This leads to federated circuits (FCs) -- a novel and flexible federated learning (FL) framework that (1) allows one to scale PCs on distributed learning environments (2) train PCs faster and (3) unifies for the first time horizontal, vertical, and hybrid FL in one framework by re-framing FL as a density estimation problem over distributed datasets. We demonstrate FC's capability to scale PCs on various large-scale datasets. Also, we show FC's versatility in handling horizontal, vertical, and hybrid FL within a unified framework on multiple classification tasks.
Chinese Summary: 联邦电路(FCs)提出了一种新颖的联邦学习框架,通过将联邦学习重新定义为分布式数据集的密度估计问题,实现了概率电路在分布式环境中的扩展,并首次统一了水平、垂直和混合联邦学习。
English Summary: Federated Circuits (FCs) introduce a novel federated learning framework that scales probabilistic circuits across distributed environments, unifying horizontal, vertical, and hybrid FL by treating it as a density estimation problem.

Authors:Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, Zongyuan Ge
Title: MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models
Abstract:
Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.
中文: 本文提出MoRE模型,一种面向四足机器人的视觉-语言-动作框架,通过多专家集成和强化学习实现多任务适应,实验证明其在多种技能和实际场景中均优于现有方法。
English: This paper presents MoRE, a vision-language-action model for quadruped robots that integrates multiple experts within a multimodal framework and uses reinforcement learning to enhance adaptability across diverse tasks, demonstrating superior performance in experiments and real-world applications.

Authors:Lei Zhu, Yanyu Xu, Huazhu Fu, Xinxing Xu, Rick Siow Mong Goh, Yong Liu
Title: Partially Supervised Unpaired Multi-Modal Learning for Label-Efficient Medical Image Segmentation
Abstract:
Unpaired Multi-Modal Learning (UMML) which leverages unpaired multi-modal data to boost model performance on each individual modality has attracted a lot of research interests in medical image analysis. However, existing UMML methods require multi-modal datasets to be fully labeled, which incurs tremendous annotation cost. In this paper, we investigate the use of partially labeled data for label-efficient unpaired multi-modal learning, which can reduce the annotation cost by up to one half. We term the new learning paradigm as Partially Supervised Unpaired Multi-Modal Learning (PSUMML) and propose a novel Decomposed partial class adaptation with snapshot Ensembled Self-Training (DEST) framework for it. Specifically, our framework consists of a compact segmentation network with modality specific normalization layers for learning with partially labeled unpaired multi-modal data. The key challenge in PSUMML lies in the complex partial class distribution discrepancy due to partial class annotation, which hinders effective knowledge transfer across modalities. We theoretically analyze this phenomenon with a decomposition theorem and propose a decomposed partial class adaptation technique to precisely align the partially labeled classes across modalities to reduce the distribution discrepancy. We further propose a snapshot ensembled self-training technique to leverage the valuable snapshot models during training to assign pseudo-labels to partially labeled pixels for self-training to boost model performance. We perform extensive experiments under different scenarios of PSUMML for two medical image segmentation tasks, namely cardiac substructure segmentation and abdominal multi-organ segmentation. Our framework outperforms existing methods significantly.
Chinese: 本文提出部分监督非配对多模态学习(PSUMML)及DEST框架,通过分解式部分类别适配和快照集成自训练技术,在医学图像分割任务中将标注成本降低一半并显著提升性能。
English: This paper introduces Partially Supervised Unpaired Multi-Modal Learning (PSUMML) with a novel DEST framework that reduces annotation costs by half through decomposed partial class adaptation and snapshot ensembled self-training for medical image segmentation.

Authors:Jie Liu, Tiexin Qin, Hui Liu, Yilei Shi, Lichao Mou, Xiao Xiang Zhu, Shiqi Wang, Haoliang Li
Title: Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression
Abstract:
In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature of cardiac signals. To tackle these issues, we propose a novel \textbf{Q}uasi-\textbf{P}eriodic \textbf{A}daptive \textbf{R}egression with \textbf{T}est-time Training (Q-PART) framework. In the training stage, the proposed Quasi-Period Network decomposes the echocardiogram into periodic and aperiodic components within latent space by combining parameterized helix trajectories with Neural Controlled Differential Equations. During inference, our framework further employs a variance minimization strategy across image augmentations that simulate common quality issues in echocardiogram acquisition, along with differential adaptation rates for periodic and aperiodic components. Theoretical analysis is provided to demonstrate that our variance minimization objective effectively bounds the regression error under mild conditions. Furthermore, extensive experiments across three pediatric age groups demonstrate that Q-PART not only significantly outperforms existing approaches in pediatric LVEF prediction, but also exhibits strong clinical screening capability with high mAUROC scores (up to 0.9747) and maintains gender-fair performance across all metrics, validating its robustness and practical utility in pediatric echocardiography analysis.
中文: 本研究提出Q-PART框架,通过将超声心动图分解为准周期分量并采用方差最小化策略,解决了儿科左心室射血分数评估中的关键问题,在三个儿科年龄组中展现出卓越的预测性能和临床筛查能力。
English: This study introduces Q-PART, a novel test-time training framework that addresses limitations in pediatric LVEF assessment by decomposing echocardiograms into quasi-periodic components and employing variance minimization, demonstrating superior performance and clinical utility across pediatric age groups.

Authors:Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, Yaodong Yang
Title: SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Abstract:
Vision-language-action models (VLAs) show potential as generalist robot policies. However, these models pose extreme safety challenges during real-world deployment, including the risk of harm to the environment, the robot itself, and humans. How can safety constraints be explicitly integrated into VLAs? We address this by exploring an integrated safety approach (ISA), systematically modeling safety requirements, then actively eliciting diverse unsafe behaviors, effectively constraining VLA policies via safe reinforcement learning, and rigorously assuring their safety through targeted evaluations. Leveraging the constrained Markov decision process (CMDP) paradigm, ISA optimizes VLAs from a min-max perspective against elicited safety risks. Thus, policies aligned through this comprehensive approach achieve the following key features: (I) effective safety-performance trade-offs, this exploration yields an 83.58% safety improvement compared to the current state-of-the-art method, while also maintaining task performance (+3.85%). (II) strong safety assurance, with the ability to mitigate long-tail risks and handle extreme failure scenarios. (III) robust generalization of learned safety behaviors to various out-of-distribution perturbations. Our data, models and newly proposed benchmark environment are available at https://pku-safevla.github.io.
中文摘要:集成安全方法(ISA)通过风险建模和约束强化学习系统性地提升视觉语言动作模型的安全性,在保持任务性能的同时显著提高了安全性和泛化能力。
English Summary: The integrated safety approach (ISA) systematically enhances vision-language-action models' safety through risk modeling and constrained reinforcement learning, achieving significant safety improvements and robust generalization while maintaining task performance.

Authors:Mariusz Trzeciakiewicz, Aleixo Cambeiro Barreiro, Niklas Gard, Anna Hilsmann, Peter Eisert
Title: Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction
Abstract:
Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource-efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image-based automated drywall analysis enabling construction progress and quality assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.
中文摘要:本文提出了一种基于深度学习的自动化图像分析方法,用于干墙施工的进度跟踪和质量评估,通过模型优化和数据增强显著提高了检测精度。
English Summary: This paper introduces an automated image-based system using deep learning for drywall analysis to enhance construction progress tracking and quality control, achieving higher accuracy through model improvements and data augmentation.

Authors:Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang
Title: MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
Abstract:
Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
中文:提出的MA-LoT框架通过模型协作和长思维链推理改进了Lean4定理证明,在MiniF2F-Test数据集上实现了61.07%的准确率,并通过大语言模型与Lean4验证器的结构化交互超越了现有方法。
English: The proposed MA-LoT framework enhances Lean4 theorem proving by integrating model collaboration and long chain-of-thought reasoning, achieving a 61.07% accuracy rate on the MiniF2F-Test dataset and surpassing existing methods through structured interaction between LLMs and the Lean4 verifier.

Authors:Terry Tong, Fei Wang, Zhe Zhao, Muhao Chen
Title: BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Abstract:
This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge evaluation regime, where the adversary controls both the candidate and evaluator model. The backdoored evaluator victimizes benign users by unfairly assigning inflated scores to adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the adversary's score with respect to their legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These regimes reflect a weak to strong escalation of data access that highly correlates with attack severity. Under the weakest assumptions - web poisoning (1), the adversary still induces a 20% score inflation. Likewise, in the (3) weight poisoning regime, the stronger assumptions enable the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes across different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and document reranker judges in RAG to rank the poisoned document first 97% of the time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and technology, where social implications of mislead model selection and evaluation constrain the available defensive tools. Amidst these challenges, model merging emerges as a principled tool to offset the backdoor, reducing ASR to near 0% whilst maintaining SOTA performance. Model merging's low computational cost and convenient integration into the current LLM Judge training pipeline position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.
中文摘要:本文提出针对LLM即法官评估机制的新型后门攻击,攻击者通过少量数据投毒即可操纵评分系统获得不公正优势,同时发现模型融合技术能有效防御此类攻击,在保持性能的同时将攻击成功率降至接近零。
English Summary: This paper introduces a backdoor attack targeting LLM-as-a-Judge systems where attackers manipulate both candidate and evaluator models to artificially boost scores through minimal data poisoning, while proposing model merging as an effective defense that reduces attack success to near zero without compromising performance.

Authors:Ranjan Sapkota, Manoj Karkee
Title: Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10
Abstract:
This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest mean Average Precision (mAP@50) at 0.978. In comparison, the YOLOv11 series was led by YOLO11x, which achieved the highest precision at 0.857, recall at 0.85, and mAP@50 at 0.91. For the YOLOv10 series, YOLOv10b and YOLOv10l both achieved the highest precision at 0.85, with YOLOv10n achieving the highest recall at 0.8 and mAP@50 at 0.89. These findings demonstrated that YOLOv12, when trained on realistic LLM-generated datasets surpassed its predecessors in key performance metrics. The technique also offered a cost-effective solution by reducing the need for extensive manual data collection in the agricultural field. In addition, this study compared the computational efficiency of all versions of YOLOv12, v11 and v10, where YOLOv11n reported the lowest inference time at 4.7 ms, compared to YOLOv12n's 5.6 ms and YOLOv10n's 5.9 ms. Although YOLOv12 is new and more accurate than YOLOv11, and YOLOv10, YOLO11n still stays the fastest YOLO model among YOLOv10, YOLOv11 and YOLOv12 series of models. (Index: YOLOv12, YOLOv11, YOLOv10, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO Object detection)
Chinese: 基于大语言模型生成的合成图像训练,YOLOv12模型在苹果检测的精度、召回率和mAP上均优于YOLOv11和YOLOv10,但YOLOv11n的推理速度最快。
English: The YOLOv12 model, trained on synthetic images from LLMs, outperformed YOLOv11 and YOLOv10 in precision, recall, and mAP for apple detection, though YOLOv11n remained the fastest in inference time.

Authors:Wenyan Cong, Hanqing Zhu, Peihao Wang, Bangya Liu, Dejia Xu, Kevin Wang, David Z. Pan, Yan Wang, Zhiwen Fan, Zhangyang Wang
Title: Can Test-Time Scaling Improve World Foundation Model?
Abstract:
World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. Project page: https://scalingwfm.github.io/.
中文: SWIFT是一种测试时扩展框架,通过优化策略在不重新训练或增大模型的情况下提升世界基础模型的推理效率,实证结果验证了其有效性。
English: SWIFT is a test-time scaling framework that enhances world foundation models' inference efficiency through optimized strategies without retraining or enlarging the model, as validated by empirical results.

Authors:Shiyi Yang, Zhibo Hu, Xinshu Li, Chen Wang, Tong Yu, Xiwei Xu, Liming Zhu, Lina Yao
Title: DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents
Abstract:
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self-evolve from real-world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre-defined prompts, all of which hinder their applicability to systems with dynamic memory states. This challenge is exacerbated by the black-box nature of commercial RSs. To tackle the above problems, in this paper, we present the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black-box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real-world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.
This paper presents the first systematic study of memory-based vulnerabilities in LLM-powered recommender agents, introducing DrunkAgent—a black-box attack framework that crafts adversarial triggers to corrupt memory updates and compromise system security.
English Summary:

Authors:Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff, Ahmed A. Metwally
Title: A Scalable Framework for Evaluating Health Language Models
Abstract:
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
中文: 本文提出的自适应精确布尔评估框架通过针对性布尔问题改进了医疗领域大语言模型的评估效率,相比传统方法在提升评分一致性的同时将评估时间缩短约一半。
English: This paper introduces Adaptive Precise Boolean rubrics, an efficient evaluation framework that improves assessment of large language models in healthcare by using targeted boolean questions, achieving higher agreement and faster evaluation compared to traditional methods.

Authors:Dailan He, Xiahong Wang, Shulun Wang, Guanglu Song, Bingqi Ma, Hao Shao, Yu Liu, Hongsheng Li
Title: High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Abstract:
Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.
Chinese: 本文提出了一种基于扩散模型的身份约束属性调节人脸交换框架,通过先确保身份保留再微调属性对齐的方法,在身份相似度和属性一致性方面均实现了最先进的性能。
English: This paper introduces an identity-constrained attribute-tuning framework for diffusion-based face swapping that prioritizes identity preservation before fine-tuning for attribute alignment, achieving state-of-the-art performance in both identity similarity and attribute consistency.

Authors:Ruoqi Wen, Rongpeng Li, Xing Xu, Zhifeng Zhao
Title: Multi-agent Uncertainty-Aware Pessimistic Model-Based Reinforcement Learning for Connected Autonomous Vehicles
Abstract:
Deep Reinforcement Learning (DRL) holds significant promise for achieving human-like Autonomous Vehicle (AV) capabilities, but suffers from low sample efficiency and challenges in reward design. Model-Based Reinforcement Learning (MBRL) offers improved sample efficiency and generalizability compared to Model-Free Reinforcement Learning (MFRL) in various multi-agent decision-making scenarios. Nevertheless, MBRL faces critical difficulties in estimating uncertainty during the model learning phase, thereby limiting its scalability and applicability in real-world scenarios. Additionally, most Connected Autonomous Vehicle (CAV) studies focus on single-agent decision-making, while existing multi-agent MBRL solutions lack computationally tractable algorithms with Probably Approximately Correct (PAC) guarantees, an essential factor for ensuring policy reliability with limited training data. To address these challenges, we propose MA-PMBRL, a novel Multi-Agent Pessimistic Model-Based Reinforcement Learning framework for CAVs, incorporating a max-min optimization approach to enhance robustness and decision-making. To mitigate the inherent subjectivity of uncertainty estimation in MBRL and avoid incurring catastrophic failures in AV, MA-PMBRL employs a pessimistic optimization framework combined with Projected Gradient Descent (PGD) for both model and policy learning. MA-PMBRL also employs general function approximations under partial dataset coverage to enhance learning efficiency and system-level performance. By bounding the suboptimality of the resulting policy under mild theoretical assumptions, we successfully establish PAC guarantees for MA-PMBRL, demonstrating that the proposed framework represents a significant step toward scalable, efficient, and reliable multi-agent decision-making for CAVs.
中文: 提出的MA-PMBRL框架通过结合悲观优化和投影梯度下降方法,解决了多智能体网联自动驾驶中的关键局限性,在数据受限条件下建立了可靠决策的理论保证,显著提升了系统的鲁棒性。
English: The proposed MA-PMBRL framework addresses key limitations in multi-agent connected autonomous vehicles by integrating pessimistic optimization and projected gradient descent to enhance robustness, while establishing theoretical guarantees for reliable decision-making under data constraints.

Authors:Hanwen Liang, Xian Zhong, Wenxuan Liu, Yajing Zheng, Wenxin Huang, Zhaofei Yu, Tiejun Huang
Title: SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams
Abstract:
Restoring clear frames from rainy videos presents a significant challenge due to the rapid motion of rain streaks. Traditional frame-based visual sensors, which capture scene content synchronously, struggle to capture the fast-moving details of rain accurately. In recent years, neuromorphic sensors have introduced a new paradigm for dynamic scene perception, offering microsecond temporal resolution and high dynamic range. However, existing multimodal methods that fuse event streams with RGB images face difficulties in handling the complex spatiotemporal interference of raindrops in real scenes, primarily due to hardware synchronization errors and computational redundancy. In this paper, we propose a Color Spike Stream Deraining Network (SpikeDerain), capable of reconstructing spike streams of dynamic scenes and accurately removing rain streaks. To address the challenges of data scarcity in real continuous rainfall scenes, we design a physically interpretable rain streak synthesis model that generates parameterized continuous rain patterns based on arbitrary background images. Experimental results demonstrate that the network, trained with this synthetic data, remains highly robust even under extreme rainfall conditions. These findings highlight the effectiveness and robustness of our method across varying rainfall levels and datasets, setting new standards for video deraining tasks. The code will be released soon.
中文:提出的SpikeDerain网络通过重建脉冲流并利用合成数据进行训练,能有效去除视频中的雨痕,在不同降雨条件下均展现出卓越的鲁棒性。
English: The proposed SpikeDerain network effectively removes rain streaks from videos by reconstructing spike streams and using synthetic data for training, demonstrating robustness across various rainfall conditions.

Authors:Yaofei Wang, Gang Pei, Kejiang Chen, Jinyang Ding, Chao Pan, Weilong Pang, Donghui Hu, Weiming Zhang
Title: SparSamp: Efficient Provably Secure Steganography Based on Sparse Sampling
Abstract:
Steganography embeds confidential data within seemingly innocuous communications. Provable security in steganography, a long-sought goal, has become feasible with deep generative models. However, existing methods face a critical trade-off between security and efficiency. This paper introduces SparSamp, an efficient provably secure steganography method based on sparse sampling. SparSamp embeds messages by combining them with pseudo-random numbers to obtain message-derived random numbers for sampling. It enhances extraction accuracy and embedding capacity by increasing the sampling intervals and making the sampling process sparse. SparSamp preserves the original probability distribution of the generative model, thus ensuring security. It introduces only $O(1)$ additional complexity per sampling step, enabling the fastest embedding speed without compromising generation speed. SparSamp is designed to be plug-and-play; message embedding can be achieved by simply replacing the sampling component of an existing generative model with SparSamp. We implemented SparSamp in text, image, and audio generation models. It can achieve embedding speeds of up to 755 bits/second with GPT-2, 5046 bits/second with DDPM, and 9,223 bits/second with WaveRNN.
中文:SparSamp是一种高效且可证明安全的隐写方法,通过稀疏采样嵌入信息,在保持生成模型原始分布的同时,以最小复杂度实现高速嵌入。
English: SparSamp is an efficient and provably secure steganography method that embeds messages through sparse sampling, maintaining the original generative model's distribution while achieving high embedding speeds with minimal complexity.

Authors:Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff
Title: Substance over Style: Evaluating Proactive Conversational Coaching Agents
Abstract:
While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
中文摘要:本研究探讨了对话式辅导代理的挑战,强调用户更看重核心功能而非风格,并揭示了用户反馈与专家评估之间的显著差异。
English Summary: This study explores the challenges of conversational coaching agents, highlighting user preference for core functionality over style and revealing misalignment between user feedback and expert evaluations.

Authors:Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia
Title: Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
Abstract:
Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (train a reward model on preference pairs and optimize with reinforcement learning) or reward-free (directly fine-tune on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain robust, and single-response demonstrations can outperform pairwise preference data. However, two challenges persist: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning). We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
Alignment is crucial for safely deploying LLMs, and the proposed DR-IRL method addresses dataset imbalance and static rewards by training category-specific reward models with balanced safety data and dynamically scaling rewards based on task difficulty, outperforming all baselines in safety while preserving utility.
English Summary:

Authors:Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu
Title: Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook
Abstract:
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.
中文: 检索增强生成(RAG)通过整合外部知识提升AI模型能力,已从自然语言处理扩展到计算机视觉领域,以增强视觉任务的理解与生成,同时指出了当前局限性和未来研究方向。
English: Retrieval-augmented generation (RAG) enhances AI models by integrating external knowledge, extending from natural language processing to computer vision for improved understanding and generation in visual tasks, while current limitations and future directions are outlined.

Authors:Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
Title: Fractal-IR: A Unified Framework for Efficient and Scalable Image Restoration
Abstract:
While vision transformers achieve significant breakthroughs in various image restoration (IR) tasks, it is still challenging to efficiently scale them across multiple types of degradations and resolutions. In this paper, we propose Fractal-IR, a fractal-based design that progressively refines degraded images by repeatedly expanding local information into broader regions. This fractal architecture naturally captures local details at early stages and seamlessly transitions toward global context in deeper fractal stages, removing the need for computationally heavy long-range self-attention mechanisms. Moveover, we observe the challenge in scaling up vision transformers for IR tasks. Through a series of analyses, we identify a holistic set of strategies to effectively guide model scaling. Extensive experimental results show that Fractal-IR achieves state-of-the-art performance in seven common image restoration tasks, including super-resolution, denoising, JPEG artifact removal, IR in adverse weather conditions, motion deblurring, defocus deblurring, and demosaicking. For $2\times$ SR on Manga109, Fractal-IR achieves a 0.21 dB PSNR gain. For grayscale image denoising on Urban100, Fractal-IR surpasses the previous method by 0.2 dB for $σ=50$.
中文摘要:Fractal-IR采用分形架构,通过将局部信息逐步扩展至更广区域来优化退化图像,无需复杂的长程自注意力机制,便在七种图像复原任务中实现了顶尖性能。
English Summary: Fractal-IR introduces a fractal-based architecture that progressively refines degraded images by expanding local details into broader contexts, achieving state-of-the-art performance across seven image restoration tasks without heavy computational mechanisms.

Authors:Kechen Meng, Sinuo Zhang, Rongpeng Li, Chan Wang, Ming Lei, Zhifeng Zhao
Title: Conditional Diffusion Model with OOD Mitigation as High-Dimensional Offline Resource Allocation Planner in Clustered Ad Hoc Networks
Abstract:
Due to network delays and scalability limitations, clustered ad hoc networks widely adopt Reinforcement Learning (RL) for on-demand resource allocation. Albeit its demonstrated agility, traditional Model-Free RL (MFRL) solutions struggle to tackle the huge action space, which generally explodes exponentially along with the number of resource allocation units, enduring low sampling efficiency and high interaction cost. In contrast to MFRL, Model-Based RL (MBRL) offers an alternative solution to boost sample efficiency and stabilize the training by explicitly leveraging a learned environment model. However, establishing an accurate dynamic model for complex and noisy environments necessitates a careful balance between model accuracy and computational complexity $\&$ stability. To address these issues, we propose a Conditional Diffusion Model Planner (CDMP) for high-dimensional offline resource allocation in clustered ad hoc networks. By leveraging the astonishing generative capability of Diffusion Models (DMs), our approach enables the accurate modeling of high-quality environmental dynamics while leveraging an inverse dynamics model to plan a superior policy. Beyond simply adopting DMs in offline RL, we further incorporate the CDMP algorithm with a theoretically guaranteed, uncertainty-aware penalty metric, which theoretically and empirically manifests itself in mitigating the Out-of-Distribution (OOD)-induced distribution shift issue underlying scarce training data. Extensive experiments also show that our model outperforms MFRL in average reward and Quality of Service (QoS) while demonstrating comparable performance to other MBRL algorithms.
Chinese: 针对集群自组织网络中传统强化学习方法面临的高维动作空间和低效问题,本文提出了一种条件扩散模型规划器,通过精确建模环境动态并结合理论保证的不确定性惩罚,有效提升了策略规划性能并缓解了数据分布偏移。
English: To address the challenges of large action spaces and low efficiency in traditional reinforcement learning for clustered ad hoc networks, this paper introduces a Conditional Diffusion Model Planner that enhances environmental modeling and policy planning while mitigating distribution shift issues with theoretical guarantees.

Authors:Hassan S. Al Khatib, Sudip Mittal, Shahram Rahimi, Nina Marhamati, Sean Bozorgzad
Title: From Patient Consultations to Graphs: Leveraging LLMs for Patient Journey Knowledge Graph Construction
Abstract:
The transition towards patient-centric healthcare necessitates a comprehensive understanding of patient journeys, which encompass all healthcare experiences and interactions across the care spectrum. Existing healthcare data systems are often fragmented and lack a holistic representation of patient trajectories, creating challenges for coordinated care and personalized interventions. Patient Journey Knowledge Graphs (PJKGs) represent a novel approach to addressing the challenge of fragmented healthcare data by integrating diverse patient information into a unified, structured representation. This paper presents a methodology for constructing PJKGs using Large Language Models (LLMs) to process and structure both formal clinical documentation and unstructured patient-provider conversations. These graphs encapsulate temporal and causal relationships among clinical encounters, diagnoses, treatments, and outcomes, enabling advanced temporal reasoning and personalized care insights. The research evaluates four different LLMs, such as Claude 3.5, Mistral, Llama 3.1, and Chatgpt4o, in their ability to generate accurate and computationally efficient knowledge graphs. Results demonstrate that while all models achieved perfect structural compliance, they exhibited variations in medical entity processing and computational efficiency. The paper concludes by identifying key challenges and future research directions. This work contributes to advancing patient-centric healthcare through the development of comprehensive, actionable knowledge graphs that support improved care coordination and outcome prediction.
中文: 本文提出利用大语言模型构建患者旅程知识图谱的方法,将碎片化医疗数据整合为结构化表示以改善护理协调和个性化洞察,评估显示不同模型在实体处理和计算效率方面存在差异。
English: This paper introduces a methodology using Large Language Models to construct Patient Journey Knowledge Graphs, which integrate fragmented healthcare data into structured representations for improved care coordination and personalized insights, with evaluations showing variations in entity processing and efficiency among different models.

Authors:Jasper Stone, Raj Patel, Farbod Ghiasi, Sudip Mittal, Shahram Rahimi
Title: Navigating MLOps: Insights into Maturity, Lifecycle, Tools, and Careers
Abstract:
The adoption of Machine Learning Operations (MLOps) enables automation and reliable model deployments across industries. However, differing MLOps lifecycle frameworks and maturity models proposed by industry, academia, and organizations have led to confusion regarding standard adoption practices. This paper introduces a unified MLOps lifecycle framework, further incorporating Large Language Model Operations (LLMOps), to address this gap. Additionally, we outlines key roles, tools, and costs associated with MLOps adoption at various maturity levels. By providing a standardized framework, we aim to help organizations clearly define and allocate the resources needed to implement MLOps effectively.
中文: 本文提出一个融合LLMOps的统一MLOps生命周期框架,以解决现有框架的混乱问题,并为不同成熟度组织提供角色、工具和成本方面的实施指导。
English: This paper presents a unified MLOps lifecycle framework that integrates LLMOps to resolve inconsistencies in existing frameworks and provides guidance on roles, tools, and costs for effective implementation across maturity levels.

Authors:Hangtao Zhang, Yichen Wang, Shihui Yan, Chenyu Zhu, Ziqi Zhou, Linshan Hou, Shengshan Hu, Minghui Li, Yanjun Zhang, Leo Yu Zhang
Title: Test-Time Backdoor Detection for Object Detection Models
Abstract:
Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.
中文摘要:目标检测模型易受通过投毒训练样本实施的后门攻击,而TRACE方法通过评估变换背景和焦点信息时的检测一致性,能有效识别这些威胁,显著优于现有防御方案。
English Summary: Object detection models are susceptible to backdoor attacks through poisoned training samples, and the proposed TRACE method effectively detects these threats by evaluating detection consistency across transformed backgrounds and focal information, significantly outperforming existing defenses.

Authors:Yechao Zhang, Yingzhe Xu, Junyu Shi, Leo Yu Zhang, Shengshan Hu, Minghui Li, Yanjun Zhang
Title: Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization
Abstract:
Deep neural networks (DNNs) are susceptible to universal adversarial perturbations (UAPs). These perturbations are meticulously designed to fool the target model universally across all sample classes. Unlike instance-specific adversarial examples (AEs), generating UAPs is more complex because they must be generalized across a wide range of data samples and models. Our research reveals that existing universal attack methods, which optimize UAPs using DNNs with static model parameter snapshots, do not fully leverage the potential of DNNs to generate more effective UAPs. Rather than optimizing UAPs against static DNN models with a fixed training set, we suggest using dynamic model-data pairs to generate UAPs. In particular, we introduce a dynamic maximin optimization strategy, aiming to optimize the UAP across a variety of optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an iterative max-min-min optimization framework that refines the model-data pairs, coupled with a curriculum UAP learning algorithm to examine the combined space of model parameters and data thoroughly. Comprehensive experiments on the ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both cross-sample universality and cross-model transferability of UAPs. Using only 500 samples for UAP generation, DM-UAP outperforms the state-of-the-art approach with an average increase in fooling ratio of 12.108%.
中文: 深度神经网络易受通用对抗扰动影响,而提出的DM-UAP方法利用动态模型-数据对显著提升了这些攻击的通用性和迁移性。
English: Deep neural networks are vulnerable to universal adversarial perturbations, and the proposed DM-UAP method uses dynamic model-data pairs to significantly enhance both universality and transferability of these attacks.

Authors:Tuomas Jalonen, Mohammad Al-Sa'd, Serkan Kiranyaz, Moncef Gabbouj
Title: Semi-Supervised Co-Training of Time and Time-Frequency Models: Application to Bearing Fault Diagnosis
Abstract:
Neural networks require massive amounts of annotated data to train intelligent solutions. Acquiring many labeled data in industrial applications is often difficult; therefore, semi-supervised approaches are preferred. We propose a new semi-supervised co-training method, which combines time and time-frequency (TF) machine learning models to improve performance and reliability. The developed framework collaboratively co-trains fast time-domain models by utilizing high-performing TF techniques without increasing the inference complexity. Besides, it operates in cloud-edge networks and offers holistic support for many applications covering edge-real-time monitoring and cloud-based updates and corrections. Experimental results on bearing fault diagnosis verify the superiority of our technique compared to a competing self-training method. The results from two case studies show that our method outperforms self-training for different noise levels and amounts of available data with accuracy gains reaching from 10.6% to 33.9%. They demonstrate that fusing time-domain and TF-based models offers opportunities for developing high-performance industrial solutions.
中文摘要:提出的半监督协同训练方法融合时域和时频模型,在不增加推理复杂度的前提下,通过轴承故障诊断实验验证了该方法在不同数据条件下的优越性,准确率提升达10.6%至33.9%。
English Summary: The proposed semi-supervised co-training method combines time and time-frequency models to enhance industrial diagnostics, demonstrating superior accuracy over self-training across various data conditions without increasing inference complexity.

Authors:Yudong Liu, Jingwei Sun, Yueqian Lin, Jingyang Zhang, Ming Yin, Qinsi Wang, Jianyi Zhang, Hai Li, Yiran Chen
Title: Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Abstract:
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
中文: KVTP是一种新颖框架,通过结合关键帧选择与自适应令牌剪枝,在保持长视频理解所需的关键上下文和时空信息的同时,显著降低了视觉语言模型的计算开销。
English: KVTP is a novel framework that combines keyframe selection with adaptive token pruning to significantly reduce computational overhead in vision language models while preserving essential contextual and spatio-temporal information for long-form video understanding.

Authors:Yunli Wang, Zhen Zhang, Zhiqiang Wang, Zixuan Yang, Yu Li, Jian Yang, Shiyang Wen, Peng Jiang, Kun Gai
Title: Learning Cascade Ranking as One Network
Abstract:
Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances have introduced interaction-aware training paradigms, but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall of ground-truth items) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.
中文: LCRON通过引入基于级联排序选中真实项概率下界的新代理损失函数和辅助损失,使训练与系统整体目标对齐,实现了级联系统的端到端统一训练,在公开基准和工业应用中显著提升了性能。
English: LCRON introduces a novel surrogate loss function and auxiliary losses to align training with the end-to-end recall objective of cascade ranking systems, enabling unified network training and significantly improving performance on benchmarks and industrial applications.

Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Title: Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation
Abstract:
Recent diffusion model customization has shown impressive results in incorporating subject or style concepts with a handful of images. However, the modular composition of multiple concepts into a customized model, aimed to efficiently merge decentralized-trained concepts without influencing their identities, remains unresolved. Modular customization is essential for applications like concept stylization and multi-concept customization using concepts trained by different users. Existing post-training methods are only confined to a fixed set of concepts, and any different combinations require a new round of retraining. In contrast, instant merging methods often cause identity loss and interference of individual merged concepts and are usually limited to a small number of concepts. To address these issues, we propose BlockLoRA, an instant merging method designed to efficiently combine multiple concepts while accurately preserving individual concepts' identity. With a careful analysis of the underlying reason for interference, we develop the Randomized Output Erasure technique to minimize the interference of different customized models. Additionally, Blockwise LoRA Parameterization is proposed to reduce the identity loss during instant model merging. Extensive experiments validate the effectiveness of BlockLoRA, which can instantly merge 15 concepts of people, subjects, scenes, and styles with high fidelity.
中文: BlockLoRA是一种即时融合方法,能高效合并多个概念并保持各自特性,通过随机输出擦除和分块LoRA参数化技术解决干扰和身份损失问题。
English: BlockLoRA is an instant merging method that efficiently combines multiple concepts while preserving their individual identities, addressing interference and identity loss through Randomized Output Erasure and Blockwise LoRA Parameterization.

Authors:Renxuan Tan, Rongpeng Li, Zhifeng Zhao
Title: LLM4MAC: An LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence
Abstract:
With the advent of 6G systems, emerging hyper-connected ecosystems necessitate agile and adaptive medium access control (MAC) protocols to contend with network dynamics and diverse service requirements. We propose LLM4MAC, a novel framework that harnesses large language models (LLMs) within a reinforcement learning paradigm to drive MAC protocol emergence. By reformulating uplink data transmission scheduling as a semantics-generalized partially observable Markov game (POMG), LLM4MAC encodes network operations in natural language, while proximal policy optimization (PPO) ensures continuous alignment with the evolving network dynamics. A structured identity embedding (SIE) mechanism further enables robust coordination among heterogeneous agents. Extensive simulations demonstrate that on top of a compact LLM, which is purposefully selected to balance performance with resource efficiency, the protocol emerging from LLM4MAC outperforms comparative baselines in throughput and generalization.
中文摘要:LLM4MAC是一种创新框架,通过将大型语言模型与强化学习相结合,为6G网络开发自适应媒体接入控制协议,仿真结果显示其在吞吐量和泛化能力方面优于现有基准方案。
English Summary: LLM4MAC is a novel framework that integrates large language models with reinforcement learning to develop adaptive medium access control protocols for 6G networks, demonstrating superior throughput and generalization in simulations.

Authors:Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Seid Muhie Yimam, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine De Kock, Tadesse Destaw Belay, Ibrahim Said Ahmad, Nirmal Surange, Daniela Teodorescu, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino Ali, Vladimir Araujo, Abinew Ali Ayele, Oana Ignat, Alexander Panchenko, Yi Zhou, Saif M. Mohammad
Title: SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Abstract:
We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and are spoken across various continents. The data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, along with findings on the best-performing systems, the most common approaches, and the most effective methods across different tracks and languages. The datasets for this task are publicly available. The dataset is available at SemEval2025 Task 11 https://brighter-dataset.github.io
中文: 这项基于文本的情感检测共享任务涵盖了七大语系的30多种低资源语言,采用多标签情感分类并吸引了700多名参与者提交200多份最终方案,相关数据集已公开。
English: This shared task on text-based emotion detection involved over 30 predominantly low-resource languages across seven language families, featuring multi-labeled emotional classes and attracting more than 700 participants with over 200 final submissions, with datasets now publicly available.

Authors:Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu
Title: VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
Abstract:
Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.
中文: 大型视觉语言模型在事实性问答中仍频繁生成非真实答案,为此我们开发了VisualSimpleQA基准,支持对视觉与语言模块的分离评估,并发现即使顶尖模型也存在明显性能不足。
English: Large vision-language models still struggle with generating non-factual responses in fact-seeking QA, prompting the creation of VisualSimpleQA, a benchmark that enables decoupled evaluation of visual and linguistic modules and reveals significant performance gaps even in top models.

Authors:Di Kevin Gao, Sudip Mittal, Jiming Wu, Hongwei Du, Jingdao Chen, Shahram Rahimi
Title: The AI Pentad, the CHARME$^{2}$D Model, and an Assessment of Current-State AI Regulation
Abstract:
Artificial Intelligence (AI) has made remarkable progress in the past few years with AI-enabled applications beginning to permeate every aspect of our society. Despite the widespread consensus on the need to regulate AI, there remains a lack of a unified approach to framing, developing, and assessing AI regulations. Many of the existing methods take a value-based approach, for example, accountability, fairness, free from bias, transparency, and trust. However, these methods often face challenges at the outset due to disagreements in academia over the subjective nature of these definitions. This paper aims to establish a unifying model for AI regulation from the perspective of core AI components. We first introduce the AI Pentad, which comprises the five essential components of AI: humans and organizations, algorithms, data, computing, and energy. We then review AI regulatory enablers, including AI registration and disclosure, AI monitoring, and AI enforcement mechanisms. Subsequently, we present the CHARME$^{2}$D Model to explore further the relationship between the AI Pentad and AI regulatory enablers. Finally, we apply the CHARME$^{2}$D model to assess AI regulatory efforts in the European Union (EU), China, the United Arab Emirates (UAE), the United Kingdom (UK), and the United States (US), highlighting their strengths, weaknesses, and gaps. This comparative evaluation offers insights for future legislative work in the AI domain.
中文: 本文基于人工智能五要素和CHARME²D模型提出统一监管框架,通过多国案例评估揭示现有AI法规的优势与不足,为未来立法提供参考。
English: This paper proposes a unified AI regulation model based on the AI Pentad components and CHARME²D framework, evaluating regulatory approaches across multiple nations to identify strengths and gaps for future legislation.

Authors:Kejia Chen, Jiawen Zhang, Jiacong Hu, Jiazhen Yang, Jian Lou, Zunlei Feng, Mingli Song
Title: SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Abstract:
Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability, which steers the model behavior via preference fine-tuning on preference data structured as ``image - winner text - loser text'' triplets. However, existing approaches often suffer from limited diversity and high costs associated with human-annotated preference data, hindering LVLMs from fully achieving their intended alignment capabilities. We present \projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets for more effective and cheaper LVLM alignment, eliminating the need for human preference annotations. Our approach facilitates LVLMs in progressively enhancing alignment capabilities through iterative self-improvement. The key design rationale is to devise preference triplets where the winner text consistently improves in holisticness and outperforms the loser response in quality, thereby pushing the model to ``strive to the utmost'' of alignment performance through preference fine-tuning. For each given text-image pair, SHAPE introduces multiple visual augmentations and pairs them with a summarized text to serve as the winner response, while designating the original text as the loser response. Experiments across \textbf{12} benchmarks on various model architectures and sizes, including LLaVA and DeepSeek-VL, show that SHAPE achieves significant gains, for example, achieving +11.3\% on MMVet (comprehensive evaluation), +1.4\% on MMBench (general VQA), and +8.0\% on POPE (hallucination robustness) over baselines in 7B models. Notably, qualitative analyses confirm enhanced attention to visual details and better alignment with human preferences for holistic descriptions.
中文: SHAPE是一种自监督框架,可将监督式文本-图像对转化为偏好三元组,无需人工标注即可提升大型视觉语言模型的对齐能力,并在多个基准测试中实现显著性能提升。
English: SHAPE is a self-supervised framework that converts supervised text-image pairs into preference triplets to enhance Large Visual Language Models' alignment capabilities without human annotations, achieving significant performance gains across multiple benchmarks.

Authors:Shun Liao, Paolo Di Achille, Jiang Wu, Silviu Borac, Jonathan Wang, Xin Liu, Eric Teasley, Lawrence Cai, Yuzhe Yang, Yun Liu, Daniel McDuff, Hao-Wei Su, Brent Winslow, Anupam Pathak, Shwetak Patel, James A. Taylor, Jameson K. Rogers, Ming-Zher Poh
Title: Passive Heart Rate Monitoring During Smartphone Use in Everyday Life
Abstract:
Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) < 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error < 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring.
中文: PHRM深度学习系统通过智能手机面部视频实现无接触心率与静息心率监测,在不同肤色人群中均保持高精度,展现了智能手机在心血管健康平等监测领域的应用潜力。
English: PHRM is a deep learning system that uses facial videos from smartphones to passively measure heart rate and resting heart rate, achieving high accuracy across diverse skin tones and demonstrating potential for equitable cardiovascular health monitoring.

Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Title: Enhancing LLM Knowledge Learning through Generalization
Abstract:
As Large language models (LLMs) are increasingly deployed in diverse applications, faithfully integrating evolving factual knowledge into these models remains a critical challenge. Continued pre-training on paraphrased data has shown empirical promise for enhancing knowledge acquisition. However, this approach is often costly and unreliable, as it relies on external models or manual effort for rewriting, and may inadvertently alter the factual content. In this work, we hypothesize and empirically show that an LLM's ability to continually predict the same factual knowledge tokens given diverse paraphrased contexts is positively correlated with its capacity to extract that knowledge via question-answering. Based on this view and aiming to improve generalization to diverse paraphrased contexts, we introduce two strategies to enhance LLMs' ability to predict the same knowledge tokens given varied contexts, thereby enhancing knowledge acquisition. First, we propose formatting-based data augmentation, which diversifies documents conveying the same knowledge by altering document formats rather than their content, thereby preserving factual integrity. Second, we adopt sharpness-aware minimization as the optimizer to better improve generalization. Extensive experiments demonstrate our methods' effectiveness in both continued pre-training and instruction tuning, and further gains can be achieved by combining with paraphrased data.
中文摘要:本研究提出基于格式的数据增强和锐度感知最小化两种策略,通过增强大语言模型在不同语境下对事实知识的一致性预测能力,在保持事实完整性的同时有效提升知识获取效果。
English Summary: This study introduces two strategies—formatting-based data augmentation and sharpness-aware minimization—to enhance large language models' ability to consistently predict factual knowledge across diverse contexts, improving knowledge acquisition while preserving factual integrity.

Authors:Alicia Russell-Gilbert, Sudip Mittal, Shahram Rahimi, Maria Seale, Joseph Jabour, Thomas Arnold, Joshua Church
Title: RAAD-LLM: Adaptive Anomaly Detection Using LLMs and RAG Integration
Abstract:
Anomaly detection in complex industrial environments poses unique challenges, particularly in contexts characterized by data sparsity and evolving operational conditions. Predictive maintenance (PdM) in such settings demands methodologies that are adaptive, transferable, and capable of integrating domain-specific knowledge. In this paper, we present RAAD-LLM, a novel framework for adaptive anomaly detection, leveraging large language models (LLMs) integrated with Retrieval-Augmented Generation (RAG). This approach addresses the aforementioned PdM challenges. By effectively utilizing domain-specific knowledge, RAAD-LLM enhances the detection of anomalies in time series data without requiring fine-tuning on specific datasets. The framework's adaptability mechanism enables it to adjust its understanding of normal operating conditions dynamically, thus increasing detection accuracy. We validate this methodology through a real-world application for a plastics manufacturing plant and the Skoltech Anomaly Benchmark (SKAB). Results show significant improvements over our previous model with an accuracy increase from 70.7% to 88.6% on the real-world dataset. By allowing for the enriching of input series data with semantics, RAAD-LLM incorporates multimodal capabilities that facilitate more collaborative decision-making between the model and plant operators. Overall, our findings support RAAD-LLM's ability to revolutionize anomaly detection methodologies in PdM, potentially leading to a paradigm shift in how anomaly detection is implemented across various industries.
中文: RAAD-LLM是一种自适应异常检测框架,通过结合大型语言模型与检索增强生成技术,能动态提升预测性维护中的检测精度,实际应用验证了其显著性能提升。
English: RAAD-LLM is an adaptive anomaly detection framework that integrates large language models with retrieval-augmented generation to dynamically improve detection accuracy in predictive maintenance, validated by significant performance gains in real-world applications.

Authors:Jiankai Tang, Xin Liu, Daniel McDuff, Zhang Jiang, Hongming Hu, Luxi Zhou, Nodoka Nagao, Haruta Suzuki, Yuki Nagahama, Wei Li, Linhong Ji, Yuanchun Shi, Izumi Nishidate, Yuntao Wang
Title: Camera Measurement of Blood Oxygen Saturation
Abstract:
Blood oxygen saturation (SpO2) is a crucial vital sign routinely monitored in medical settings. Traditional methods require dedicated contact sensors, limiting accessibility and comfort. This study presents a deep learning framework for contactless SpO2 measurement using an off-the-shelf camera, addressing challenges related to lighting variations and skin tone diversity. We conducted two large-scale studies with diverse participants and evaluated our method against traditional signal processing approaches in intra- and inter-dataset scenarios. Our approach demonstrated consistent accuracy across demographic groups, highlighting the feasibility of camera-based SpO2 monitoring as a scalable and non-invasive tool for remote health assessment.
中文摘要:本研究提出一种利用普通摄像头实现无接触血氧饱和度监测的深度学习框架,有效解决了光照变化与肤色差异的挑战,并在跨人群测试中保持精准度。
English Summary: This study introduces a deep learning framework that enables contactless blood oxygen saturation measurement using standard cameras, overcoming lighting and skin tone challenges while maintaining accuracy across diverse populations.

Authors:Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Dong Huang, Terry Yue Zhuo, Qian Liu, See-Kiong Ng
Title: CodeArena: A Collective Evaluation Platform for LLM Code Generation
Abstract:
Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. The key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.
Chinese: CodeArena是一个创新的在线评估框架,通过集体评估机制、公开的解决方案与测试用例库以及自动化友好API,解决了大语言模型代码生成评估中的基准泄露和系统可访问性限制等挑战。
English: CodeArena is an innovative online framework designed to address challenges like benchmark leakage and limited accessibility in evaluating Large Language Models' code generation capabilities, featuring a collective evaluation mechanism, open access to solutions and test cases, and automation-friendly APIs.

Authors:Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, Tong Yu
Title: Active Learning for Direct Preference Optimization
Abstract:
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.
Chinese: 直接偏好优化(DPO)通过引入主动学习框架,在线性和离线场景下高效选择最具信息量的人类反馈,采用线性化目标和D-最优设计,有效提升性能并减少估计误差。
English: Direct preference optimization (DPO) is enhanced with an active learning framework that efficiently selects the most informative human feedback, both online and offline, by linearizing the objective and applying D-optimal design, leading to improved performance and reduced estimation errors.

Authors:Theodore Curran, Chengqian Ma, Xin Liu, Daniel McDuff, Girish Narayanswamy, George Stergiou, Shwetak Patel, Eugene Yang
Title: Estimating Blood Pressure with a Camera: An Exploratory Study of Ambulatory Patients with Cardiovascular Disease
Abstract:
Hypertension is a leading cause of morbidity and mortality worldwide. The ability to diagnose and treat hypertension in the ambulatory population is hindered by limited access and poor adherence to current methods of monitoring blood pressure (BP), specifically, cuff-based devices. Remote photoplethysmography (rPPG) evaluates an individual's pulse waveform through a standard camera without physical contact. Cameras are readily available to the majority of the global population via embedded technologies such as smartphones, thus rPPG is a scalable and promising non-invasive method of BP monitoring. The few studies investigating rPPG for BP measurement have excluded high-risk populations, including those with cardiovascular disease (CVD) or its risk factors, as well as subjects in active cardiac arrhythmia. The impact of arrhythmia, like atrial fibrillation, on the prediction of BP using rPPG is currently uncertain. We performed a study to better understand the relationship between rPPG and BP in a real-world sample of ambulatory patients from a cardiology clinic with established CVD or risk factors for CVD. We collected simultaneous rPPG, PPG, BP, ECG, and other vital signs data from 143 subjects while at rest, and used this data plus demographics to train a deep learning model to predict BP. We report that facial rPPG yields a signal that is comparable to finger PPG. Pulse wave analysis (PWA)-based BP estimates on this cohort performed comparably to studies on healthier subjects, and notably, the accuracy of BP prediction in subjects with atrial fibrillation was not inferior to subjects with normal sinus rhythm. In a binary classification task, the rPPG model identified subjects with systolic BP $\geq$ 130 mm Hg with a positive predictive value of 71% (baseline prevalence 48.3%), highlighting the potential of rPPG for hypertension monitoring.
中文: 远程光电容积描记法(rPPG)通过普通摄像头提供可扩展的无创血压监测方案,深度学习模型在包括房颤患者在内的高风险人群中展现出与传统方法相当的预测准确性。
English: Remote photoplethysmography (rPPG) offers a scalable, non-invasive method for blood pressure monitoring using standard cameras, with a deep learning model demonstrating comparable accuracy to traditional methods even in high-risk patients, including those with atrial fibrillation.

Authors:Jiankai Tang, Jiacheng Liu, Renling Tong, Kai Zhu, Zhe Li, Xin Yi, Junliang Xing, Yuanchun Shi, Yuntao Wang
Title: Exploring Reliable PPG Authentication on Smartwatches in Daily Scenarios
Abstract:
Photoplethysmography (PPG) Sensors, widely deployed in smartwatches, offer a simple and non-invasive authentication approach for daily use. However, PPG authentication faces reliability issues due to motion artifacts from physical activity and physiological variability over time. To address these challenges, we propose MTL-RAPID, an efficient and reliable PPG authentication model, that employs a multitask joint training strategy, simultaneously assessing signal quality and verifying user identity. The joint optimization of these two tasks in MTL-RAPID results in a structure that outperforms models trained on individual tasks separately, achieving stronger performance with fewer parameters. In our comprehensive user studies regarding motion artifacts (N = 30), time variations (N = 32), and user preferences (N = 16), MTL-RAPID achieves a best AUC of 99.2\% and an EER of 3.5\%, outperforming existing baselines. We opensource our PPG authentication dataset along with the MTL-RAPID model to facilitate future research on GitHub.
Chinese: 提出的MTL-RAPID模型通过联合优化信号质量评估与身份验证任务,在存在运动干扰和生理变化的情况下,以更少参数实现99.2%的AUC优异性能,显著提升了PPG认证的可靠性。
English: The proposed MTL-RAPID model enhances PPG authentication by jointly optimizing signal quality assessment and identity verification, achieving superior performance with 99.2% AUC and reduced parameters despite motion and physiological variations.

Authors:Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
Title: JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
Abstract:
This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.
中文:JavisDiT是一种新颖的联合视听扩散变换器,通过分层对齐机制生成同步的高质量音视频内容,并凭借其JavisBench数据集设立了新基准。
English: JavisDiT is a novel Joint Audio-Video Diffusion Transformer that generates synchronized high-quality audio and video content using a hierarchical alignment mechanism and sets a new benchmark with its JavisBench dataset.

Authors:Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, Xiangyang Xue
Title: ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Abstract:
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
中文摘要:ReasonGrounder框架通过分层高斯场和大型视觉语言模型解析,实现了开放词汇的3D视觉定位与推理,能基于物理尺度自适应分组并融合多视角语义,精准定位包括遮挡物体在内的目标。
English Summary: The ReasonGrounder framework enhances open-vocabulary 3D visual grounding by leveraging hierarchical Gaussian fields and LVLM interpretation to accurately localize objects, including occluded ones, through adaptive scale-based grouping and multi-view semantic integration.

Authors:Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang
Title: Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Abstract:
Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.
中文: 现有3D视觉语言基准存在测试数据缺陷、评估指标简化及任务割裂问题,为此提出的Beacon3D通过高质量数据、以对象为中心的评估和链式分析新范式,能有效揭示模型真实性能并促进该领域可靠发展。
English: Current 3D vision-language benchmarks suffer from flawed data, oversimplified metrics, and task isolation, prompting the creation of Beacon3D to provide robust evaluation and reveal true model capabilities through high-quality data, object-centric testing, and a chain-of-analysis approach.

Authors:Yizhang Zhu, Runzhi Jiang, Boyan Li, Nan Tang, Yuyu Luo
Title: EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing
Abstract:
Text-to-SQL automatically translates natural language queries to SQL, allowing non-technical users to retrieve data from databases without specialized SQL knowledge. Despite the success of advanced LLM-based Text-to-SQL approaches on leaderboards, their unsustainable computational costs--often overlooked--stand as the "elephant in the room" in current leaderboard-driven research, limiting their economic practicability for real-world deployment and widespread adoption. To tackle this, we exploratively propose EllieSQL, a complexity-aware routing framework that assigns queries to suitable SQL generation pipelines based on estimated complexity. We investigate multiple routers to direct simple queries to efficient approaches while reserving computationally intensive methods for complex cases. Drawing from economics, we introduce the Token Elasticity of Performance (TEP) metric, capturing cost-efficiency by quantifying the responsiveness of performance gains relative to token investment in SQL generation. Experiments show that compared to always using the most advanced methods in our study, EllieSQL with the Qwen2.5-0.5B-DPO router reduces token use by over 40% without compromising performance on Bird development set, achieving more than a 2x boost in TEP over non-routing approaches. This not only advances the pursuit of cost-efficient Text-to-SQL but also invites the community to weigh resource efficiency alongside performance, contributing to progress in sustainable Text-to-SQL. Our source code and model are available at https://elliesql.github.io/.
Chinese: 该研究提出EllieSQL,一种基于复杂度的路由框架,通过将查询分配给合适的SQL生成流程,在不损失性能的情况下将计算成本降低超过40%,并利用新提出的性能代币弹性指标提升了成本效益。
English: The study introduces EllieSQL, a complexity-aware routing framework that optimizes Text-to-SQL systems by directing queries to appropriate SQL generation pipelines, reducing computational costs by over 40% without performance loss and enhancing cost-efficiency through a novel Token Elasticity of Performance metric.

Authors:Ruiqi Liu, Boyu Diao, Libo Huang, Hangda Liu, Chuanguang Yang, Zhulin An, Yongjun Xu
Title: Efficient Continual Learning through Frequency Decomposition and Integration
Abstract:
Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-world systems with dynamic data streams. The human perceptual system processes visual scenes through complementary frequency channels: low-frequency signals capture holistic cues, while high-frequency components convey structural details vital for fine-grained discrimination. Inspired by this, we propose the Frequency Decomposition and Integration Network (FDINet), a novel framework that decomposes and integrates information across frequencies. FDINet designs two lightweight networks to independently process low- and high-frequency components of images. When integrated with rehearsal-based methods, this frequency-aware design effectively enhances cross-task generalization through low-frequency information, preserves class-specific details using high-frequency information, and facilitates efficient training due to its lightweight architecture. Experiments demonstrate that FDINet reduces backbone parameters by 78%, improves accuracy by up to 7.49% over state-of-the-art (SOTA) methods, and decreases peak memory usage by up to 80%. Additionally, on edge devices, FDINet accelerates training by up to 5$\times$.
Chinese: FDINet是一种新颖的持续学习框架,通过频率分解增强跨任务泛化能力并保留类别细节,在减少模型参数的同时显著提升了精度、内存效率和训练速度。
English: FDINet is a novel continual learning framework that leverages frequency decomposition to enhance cross-task generalization and preserve class-specific details, achieving significant improvements in accuracy, memory efficiency, and training speed while reducing model parameters.

Authors:Zhihan Zhang, Xunkai Li, Zhu Lei, Guang Zeng, Ronghua Li, Guoren Wang
Title: Rethinking Graph Structure Learning in the Era of LLMs
Abstract:
Recently, the emergence of LLMs has prompted researchers to integrate language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 10 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone; scalability-outperforms other LLM-enhanced graph learning methods; effectiveness-achieves SOTA predictive performance.
中文摘要:针对文本属性图(TAGs)的图结构学习需求,研究者提出了LLaTA模型,通过树优化框架和免训练的大语言模型集成,实现了灵活可扩展的拓扑理解与预测性能提升。
English Summary: Recent advancements in text-attributed graphs (TAGs) require new graph structure learning approaches, leading to the development of LLaTA which uses tree-based optimization and LLM integration to achieve state-of-the-art performance through flexible and scalable inference.

Authors:Changlun Li, Yao Shi, Yuyu Luo, Nan Tang
Title: Rise of the Community Champions: From Reviewer Crunch to Community Power
Abstract:
Academic publishing is facing a crisis driven by exponential growth in submissions and an overwhelmed peer review system, leading to inconsistent decisions and a severe reviewer shortage. This paper introduces Panvas, a platform that reimagines academic publishing as a continuous, community-driven process. Panvas addresses these systemic failures with a novel combination of economic incentives (paid reviews) and rich interaction mechanisms (multi-dimensional ratings, threaded discussions, and expert-led reviews). By moving beyond the traditional accept/reject paradigm and integrating paper hosting with code/data repositories and social networking, Panvas fosters a meritocratic environment for scholarly communication and presents a radical rethinking of how we evaluate and disseminate scientific knowledge. We present the system design, development roadmap, and a user study plan to evaluate its effectiveness.
Chinese: Panvas平台通过引入付费评审和互动机制,将学术出版重塑为持续、社区驱动的流程,超越传统的接受/拒绝模式,构建了基于学术贡献的知识共享体系。
English: The Panvas platform tackles the crisis in academic publishing by transforming it into a continuous, community-driven process with paid reviews and interactive features, moving beyond traditional accept/reject decisions to create a merit-based system for sharing knowledge.

Authors:Yuheng Yuan, Qiuhong Shen, Xingyi Yang, Xinchao Wang
Title: 1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering
Abstract:
4D Gaussian Splatting (4DGS) has recently gained considerable attention as a method for reconstructing dynamic scenes. Despite achieving superior quality, 4DGS typically requires substantial storage and suffers from slow rendering speed. In this work, we delve into these issues and identify two key sources of temporal redundancy. (Q1) \textbf{Short-Lifespan Gaussians}: 4DGS uses a large portion of Gaussians with short temporal span to represent scene dynamics, leading to an excessive number of Gaussians. (Q2) \textbf{Inactive Gaussians}: When rendering, only a small subset of Gaussians contributes to each frame. Despite this, all Gaussians are processed during rasterization, resulting in redundant computation overhead. To address these redundancies, we present \textbf{4DGS-1K}, which runs at over 1000 FPS on modern GPUs. For Q1, we introduce the Spatial-Temporal Variation Score, a new pruning criterion that effectively removes short-lifespan Gaussians while encouraging 4DGS to capture scene dynamics using Gaussians with longer temporal spans. For Q2, we store a mask for active Gaussians across consecutive frames, significantly reducing redundant computations in rendering. Compared to vanilla 4DGS, our method achieves a $41\times$ reduction in storage and $9\times$ faster rasterization speed on complex dynamic scenes, while maintaining comparable visual quality. Please see our project page at https://4DGS-1K.github.io.
中文: 4DGS-1K通过剔除短寿命高斯粒子和优化活动粒子选择,解决了4D高斯溅射的存储和渲染效率问题,在保持视觉质量的同时实现了41倍存储压缩和9倍渲染加速。
English: 4DGS-1K addresses storage and rendering inefficiencies in 4D Gaussian Splatting by pruning short-lifespan Gaussians and optimizing active Gaussian selection, achieving 41× storage reduction and 9× faster rendering while maintaining visual quality.

Authors:Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao
Title: Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation
Abstract:
In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
中文: 本文提出首个基于稳定扩散的自监督单目深度估计框架Jasmine,通过混合图像重建和尺度偏移GRU模块增强预测清晰度与泛化能力,在KITTI基准测试中达到最优性能并展现卓越的零样本泛化表现。
English: This paper introduces Jasmine, the first self-supervised monocular depth estimation framework based on Stable Diffusion, which enhances prediction sharpness and generalization through hybrid image reconstruction and a Scale-Shift GRU module, achieving state-of-the-art results on KITTI and superior zero-shot generalization.

Authors:Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao
Title: Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation
Abstract:
In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD's visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
中文: 本文提出首个基于稳定扩散的自监督单目深度估计框架Jasmine,通过混合图像重建和尺度偏移GRU模块增强预测清晰度与泛化能力,在KITTI基准测试中达到最优性能并展现卓越的零样本泛化表现。
English: This paper introduces Jasmine, the first self-supervised monocular depth estimation framework based on Stable Diffusion, which enhances prediction sharpness and generalization through hybrid image reconstruction and a Scale-Shift GRU module, achieving state-of-the-art results on KITTI and superior zero-shot generalization.

Authors:Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua
Title: Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Abstract:
The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.
中文: 本文提出了一种新颖的4D全景场景图生成框架,通过结合4D大语言模型与2D到4D的迁移学习,有效解决了数据稀缺问题并提升了场景理解能力,在基准测试中显著优于现有模型。
English: This paper introduces a novel framework for 4D Panoptic Scene Graph generation that integrates a 4D Large Language Model with 2D-to-4D transfer learning to overcome data scarcity and enhance scene understanding, achieving superior performance over baselines.

Authors:Shengqiong Wu, Hao Fei, Tat-Seng Chua
Title: Universal Scene Graph Generation
Abstract:
Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.
中文摘要:本文提出通用场景图(USG)这一新型表示方法,能够融合多模态输入全面表征场景语义,并设计了专用解析器USG-Par,通过跨模态对象对齐和文本中心对比学习有效解决模态差异与领域不平衡问题。
English Summary: The paper introduces Universal Scene Graph (USG), a novel representation that integrates multiple modalities to comprehensively describe scene semantics, along with a specialized parser, USG-Par, which effectively addresses cross-modal alignment and domain imbalance challenges.

Authors:Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo
Title: nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning
Abstract:
Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous Text2VIS tasks using nBench 2.0. We also propose Step-Text2Vis, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-Text2Vis outperforms all baselines, setting a new state-of-the-art for ambiguous Text2VIS tasks. Our source code and data are available at https://nvbench2.github.io/
中文: 文本到可视化面临模糊查询的挑战,nBench 2.0通过受控模糊注入流程和新模型Step-Text2Vis解决了这一问题,该模型利用逐步偏好优化超越了现有方法。
English: Text-to-Visualization faces challenges with ambiguous queries, which nBench 2.0 addresses through a controlled ambiguity-injection pipeline and a new model, Step-Text2Vis, that outperforms existing methods by using step-wise preference optimization.

Authors:Wupeng Wang, Zexu Pan, Jingru Lin, Shuai Wang, Haizhou Li
Title: Context-Aware Two-Step Training Scheme for Domain Invariant Speech Separation
Abstract:
Speech separation seeks to isolate individual speech signals from a multi-talk speech mixture. Despite much progress, a system well-trained on synthetic data often experiences performance degradation on out-of-domain data, such as real-world speech mixtures. To address this, we introduce a novel context-aware, two-stage training scheme for speech separation models. In this training scheme, the conventional end-to-end architecture is replaced with a framework that contains a context extractor and a segregator. The two modules are trained step by step to simulate the speech separation process of an auditory system. We evaluate the proposed training scheme through cross-domain experiments on both synthetic and real-world speech mixtures, and demonstrate that our new scheme effectively boosts separation quality across different domains without adaptation, as measured by signal quality metrics and word error rate (WER). Additionally, an ablation study on the real test set highlights that the context information, including phoneme and word representations from pretrained SSL models, serves as effective domain invariant training targets for separation models.
中文: 本文提出了一种上下文感知的两阶段语音分离训练方案,通过上下文提取器和分离器替代传统端到端架构,在合成和真实语音混合数据上均实现了无需适应的跨领域分离性能提升。
English: This paper introduces a context-aware, two-stage training scheme for speech separation that replaces conventional end-to-end architectures with a context extractor and segregator, demonstrating improved cross-domain performance on both synthetic and real-world speech mixtures without adaptation.

Authors:Bin Liu, Xinglin Lyu, Junhui Li, Daimeng Wei, Min Zhang, Shimin Tao, Hao Yang
Title: Improving LLM-based Document-level Machine Translation with Multi-Knowledge Fusion
Abstract:
Recent studies in prompting large language model (LLM) for document-level machine translation (DMT) primarily focus on the inter-sentence context by flatting the source document into a long sequence. This approach relies solely on the sequence of sentences within the document. However, the complexity of document-level sequences is greater than that of shorter sentence-level sequences, which may limit LLM's ability in DMT when only this single-source knowledge is used. In this paper, we propose an enhanced approach by incorporating multiple sources of knowledge, including both the document summarization and entity translation, to enhance the performance of LLM-based DMT. Given a source document, we first obtain its summarization and translation of entities via LLM as the additional knowledge. We then utilize LLMs to generate two translations of the source document by fusing these two single knowledge sources, respectively. Finally, recognizing that different sources of knowledge may aid or hinder the translation of different sentences, we refine and rank the translations by leveraging a multi-knowledge fusion strategy to ensure the best results. Experimental results in eight document-level translation tasks show that our approach achieves an average improvement of 0.8, 0.6, and 0.4 COMET scores over the baseline without extra knowledge for LLaMA3-8B-Instruct, Mistral-Nemo-Instruct, and GPT-4o-mini, respectively.
中文: 本文提出了一种改进的文档级机器翻译方法,通过结合文档摘要和实体翻译等多源知识来增强大语言模型的翻译能力,实验表明该方法在不同模型上均提升了翻译效果。
English: This paper introduces an enhanced approach for document-level machine translation using large language models by incorporating multiple knowledge sources, including document summarization and entity translation, which improves translation performance across various models.

Authors:Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, Yuyu Luo
Title: NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation
Abstract:
Natural Language to SQL (i.e., NL2SQL) translation is crucial for democratizing database access, but even state-of-the-art models frequently generate semantically incorrect SQL queries, hindering the widespread adoption of these techniques by database vendors. While existing NL2SQL benchmarks primarily focus on correct query translation, we argue that a benchmark dedicated to identifying common errors in NL2SQL translations is equally important, as accurately detecting these errors is a prerequisite for any subsequent correction-whether performed by humans or models. To address this gap, we propose NL2SQL-BUGs, the first benchmark dedicated to detecting and categorizing semantic errors in NL2SQL translation. NL2SQL-BUGs adopts a two-level taxonomy to systematically classify semantic errors, covering 9 main categories and 31 subcategories. The benchmark consists of 2,018 expert-annotated instances, each containing a natural language query, database schema, and SQL query, with detailed error annotations for semantically incorrect queries. Through comprehensive experiments, we demonstrate that current large language models exhibit significant limitations in semantic error detection, achieving an average detection accuracy of 75.16%. Specifically, our method successfully detected 106 errors (accounting for 6.91%) in BIRD, a widely-used NL2SQL dataset, which were previously undetected annotation errors. This highlights the importance of semantic error detection in NL2SQL systems. The benchmark is publicly available at https://nl2sql-bugs.github.io/.
中文: NL2SQL-BUGs是首个专注于检测和分类自然语言转SQL语义错误的基准,通过实验发现当前大语言模型在语义错误检测上存在明显不足,仅达到75.16%的准确率,并在现有数据集中成功识别出之前未被发现的标注错误。
English: NL2SQL-BUGs is the first benchmark designed to detect and classify semantic errors in NL2SQL translation, revealing significant limitations in current models with only 75.16% detection accuracy and uncovering previously missed annotation errors in existing datasets.

Authors:Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
Title: Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Abstract:
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.
中文:VAMBA混合Mamba-Transformer模型通过采用线性复杂度的Mamba-2模块,有效解决了传统Transformer模型处理长视频时的计算瓶颈,在保持所有视频令牌的同时大幅提升了训练推理效率,并在长视频理解任务上实现了准确率的显著提升。
English: VAMBA, a hybrid Mamba-Transformer model, overcomes the computational limitations of transformer-based large multimodal models by using Mamba-2 blocks to efficiently process hour-long videos with linear complexity, achieving significant improvements in GPU memory usage, speed, and accuracy on long video understanding benchmarks.

Authors:Mingjia Shi, Ruihan Lin, Xuxi Chen, Yuhao Zhou, Zezhen Ding, Pingzhi Li, Tong Wang, Kai Wang, Zhangyang Wang, Jiheng Zhang, Tianlong Chen
Title: Make Optimization Once and for All with Fine-grained Guidance
Abstract:
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks. L2O paradigms achieve great outcomes, e.g., refitting optimizer, generating unseen solutions iteratively or directly. However, conventional L2O methods require intricate design and rely on specific optimization processes, limiting scalability and generalization. Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting sampled solutions from a wider view rather than local updates in real optimization process only. Meanwhile, we give the related generalization bound, showing that the sample diversity of Diff-L2O brings better performance. This bound can be simply applied to other fields, discussing diversity, mean-variance, and different tasks. Diff-L2O's strong compatibility is empirically verified with only minute-level training, comparing with other hour-levels.
中文: Diff-L2O提出了一种通用框架,通过从更广泛的视角增强样本多样性来优化学习过程,显著提升了可扩展性和泛化能力,且训练时间极短。
English: Diff-L2O introduces a general framework that enhances optimization by focusing on diverse solution sampling from a broader perspective, improving scalability and generalization with minimal training time.

Authors:Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu
Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Abstract:
Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .
Chinese: FlowMo提出了一种基于Transformer的扩散自编码器,通过创新的两阶段训练方法,在多种压缩率下实现了图像标记化的新突破,无需依赖卷积、对抗损失等传统组件。
English: FlowMo introduces a transformer-based diffusion autoencoder that sets a new benchmark for image tokenization across multiple compression rates, utilizing a novel two-stage training approach without traditional components like convolutions or adversarial losses.

Authors:Pengcheng Wang, Xinghao Zhu, Yuxin Chen, Chenfeng Xu, Masayoshi Tomizuka, Chenran Li
Title: Residual Policy Gradient: A Reward View of KL-regularized Objective
Abstract:
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.
中文摘要:强化学习和模仿学习在现实部署中因未满足的额外要求而受限,为此提出的残差策略梯度(RPG)将策略定制扩展到策略梯度方法,并通过重新解读KL正则化目标实现了固有特性与任务需求在奖励层面的平衡。
English Summary: Reinforcement and Imitation Learning face deployment challenges due to unmet requirements, leading to the development of Residual Policy Gradient (RPG) that extends customization to policy gradient methods and reinterprets KL-regularized objectives for balancing inherent and task-specific properties.

Authors:Weisong Sun, Yiran Zhang, Jie Zhu, Zhihui Wang, Chunrong Fang, Yonglong Zhang, Yebo Feng, Jiangping Huang, Xingya Wang, Zhi Jin, Yang Liu
Title: Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization
Abstract:
Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.
Chinese: 本文针对文件级和模块级等高层代码单元自动摘要研究的不足,探索了基于大语言模型的摘要策略,发现完整代码对文件级最有效、分层摘要对模块级最优,且大语言模型的评估结果与人工评估高度一致。
English: This paper addresses the gap in automated code summarization for higher-level code units like files and modules by exploring LLM-based strategies, finding full code most effective for files and hierarchical summarization best for modules, with LLM evaluations closely matching human assessments.

Authors:Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang
Title: DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
Abstract:
Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.
中文: DeepReview通过引入多阶段框架,结合结构化分析和证据论证,在自动论文评审中超越更大模型,并对领先模型取得高胜率。
English: DeepReview introduces a multi-stage framework that outperforms larger models in automated paper review by incorporating structured analysis and evidence-based reasoning, achieving high win rates against leading models.

Authors:Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang
Title: OminiControl2: Efficient Conditioning for Diffusion Transformers
Abstract:
Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.
中文: OminiControl2 采用动态压缩策略和条件特征重用机制,将计算成本降低超过 90%,实现 5.9 倍加速,从而高效完成基于扩散变换器的图像生成。
English: OminiControl2 introduces a dynamic compression strategy and conditional feature reuse to significantly reduce computational costs by over 90% and achieve a 5.9× speedup for efficient image generation with diffusion transformer models.

Authors:Kangan Qian, Ziang Luo, Sicong Jiang, Zilin Huang, Jinyu Miao, Zhikun Ma, Tianze Zhu, Jiayin Li, Yangfan He, Zheng Fu, Yining Shi, Boyue Wang, Hezhe Lin, Ziyu Chen, Jiangbo Yu, Xinyu Jiao, Mengmeng Yang, Kun Jiang, Diange Yang
Title: FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback
Abstract:
Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by the dual-process cognitive model "Thinking, Fast and Slow", we propose $\textbf{FASIONAD}$ -- a novel dual-system framework that synergizes a fast end-to-end planner with a VLM-based reasoning module. The fast system leverages end-to-end learning to achieve real-time trajectory generation in common scenarios, while the slow system activates through uncertainty estimation to perform contextual analysis and complex scenario resolution. Our architecture introduces three key innovations: (1) A dynamic switching mechanism enabling slow system intervention based on real-time uncertainty assessment; (2) An information bottleneck with high-level plan feedback that optimizes the slow system's guidance capability; (3) A bidirectional knowledge exchange where visual prompts enhance the slow system's reasoning while its feedback refines the fast planner's decision-making. To strengthen VLM reasoning, we develop a question-answering mechanism coupled with reward-instruct training strategy. In open-loop experiments, FASIONAD achieves a $6.7\%$ reduction in average $L2$ trajectory error and $28.1\%$ lower collision rate.
中文:FASIONAD框架将快速端到端规划器与基于视觉语言模型的推理模块相结合,通过动态切换机制和双向知识交互,在不确定场景中激活慢速推理系统,显著提升了轨迹精度并降低了碰撞率。
English: The FASIONAD framework combines a fast end-to-end planner for real-time trajectory generation with a VLM-based reasoning module that activates during uncertain scenarios, achieving improved trajectory accuracy and collision avoidance through dynamic switching and bidirectional knowledge exchange.

Authors:Lavanya Ratnabala, Robinroy Peter, Aleksey Fedoseev, Dzmitry Tsetserukou
Title: HIPPO-MAT: Decentralized Task Allocation Using GraphSAGE and Multi-Agent Deep Reinforcement Learning
Abstract:
This paper tackles decentralized continuous task allocation in heterogeneous multi-agent systems. We present a novel framework HIPPO-MAT that integrates graph neural networks (GNN) employing a GraphSAGE architecture to compute independent embeddings on each agent with an Independent Proximal Policy Optimization (IPPO) approach for multi-agent deep reinforcement learning. In our system, unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) share aggregated observation data via communication channels while independently processing these inputs to generate enriched state embeddings. This design enables dynamic, cost-optimal, conflict-aware task allocation in a 3D grid environment without the need for centralized coordination. A modified A* path planner is incorporated for efficient routing and collision avoidance. Simulation experiments demonstrate scalability with up to 30 agents and preliminary real-world validation on JetBot ROS AI Robots, each running its model on a Jetson Nano and communicating through an ESP-NOW protocol using ESP32-S3, which confirms the practical viability of the approach that incorporates simultaneous localization and mapping (SLAM). Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 16.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 30 agents with allocation processing of 0.32 simulation step time and robustness in responding to dynamically generated tasks.
中文: 本文提出HIPPO-MAT框架,通过图神经网络与独立近端策略优化相结合,实现异构无人机与地面车辆的分散式动态任务分配,在仿真和实物实验中展现出高任务成功率、可扩展性和鲁棒性。
English: This paper introduces HIPPO-MAT, a decentralized framework combining Graph Neural Networks and Independent Proximal Policy Optimization for dynamic task allocation among heterogeneous agents like UAVs and UGVs, achieving high conflict-free success rates and scalability in simulations and real-world tests.

Authors:Grik Tadevosyan, Valerii Serpiva, Aleksey Fedoseev, Roohan Ahmed Khan, Demetros Aschu, Faryal Batool, Nickolay Efanov, Artem Mikhaylov, Dzmitry Tsetserukou
Title: AttentionSwarm: Reinforcement Learning with Attention Control Barier Function for Crazyflie Drones in Dynamic Environments
Abstract:
We introduce AttentionSwarm, a novel benchmark designed to evaluate safe and efficient swarm control across three challenging environments: a landing environment with obstacles, a competitive drone game setting, and a dynamic drone racing scenario. Central to our approach is the Attention Model Based Control Barrier Function (CBF) framework, which integrates attention mechanisms with safety-critical control theory to enable real-time collision avoidance and trajectory optimization. This framework dynamically prioritizes critical obstacles and agents in the swarms vicinity using attention weights, while CBFs formally guarantee safety by enforcing collision-free constraints. The safe attention net algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves landing accuracy of 3.02 cm with a mean time of 23 s and collision-free landings in a dynamic landing environment, 100% and collision-free navigation in a drone game environment, and 95% and collision-free navigation for a dynamic multiagent drone racing environment, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for applications in dynamic environments where safety and fastness are paramount.
中文:AttentionSwarm是一种新颖的基准,采用基于注意力模型的控制屏障函数框架,在着陆、无人机游戏和竞速等挑战性环境中实现安全高效的集群控制,确保了高精度和无碰撞导航。
English: AttentionSwarm is a novel benchmark using an Attention Model Based Control Barrier Function framework to ensure safe and efficient swarm control, achieving high accuracy and collision-free navigation in challenging environments like landing, drone games, and racing.

Authors:Kangan Qian, Jinyu Miao, Ziang Luo, Zheng Fu, and Jinchen Li, Yining Shi, Yunlong Wang, Kun Jiang, Mengmeng Yang, Diange Yang
Title: LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction
Abstract:
Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.
中文: 本文提出LEGO-Motion新型占据-实例建模框架,通过将实例特征融入鸟瞰图空间来增强自动驾驶中的运动预测能力,改进了交互关系建模和物理一致性,在基准测试中实现了最先进的性能表现。
English: This paper introduces LEGO-Motion, a novel occupancy-instance framework that enhances motion prediction in autonomous driving by integrating instance features into BEV space, improving interaction modeling and physics consistency to achieve state-of-the-art performance on benchmark datasets.

Authors:Junjia Du, Yadi Liu, Hongcheng Guo, Jiawei Wang, Haojian Huang, Yunyi Ni, Zhoujun Li
Title: DependEval: Benchmarking LLMs for Repository Dependency Understanding
Abstract:
While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.
中文: 大型语言模型在代码生成方面展现出潜力,但缺乏高级的仓库级推理能力,为此我们推出了DependEval分层基准,通过真实仓库评估了25个以上模型在三大核心任务上的表现,揭示了显著的性能差距。
English: Large language models show promise in code generation but lack advanced repository-level reasoning, prompting the introduction of DependEval, a hierarchical benchmark that evaluates 25+ LLMs across three core tasks using real-world repositories and reveals significant performance gaps.

Authors:Chandan Kumar Sah, Ankit Kumar Shaw, Xiaoli Lian, Arsalan Shahid Baig, Tuopu Wen, Kun Jiang, Mengmeng Yang, Diange Yang
Title: Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
Abstract:
Autonomous vehicles (AVs) require reliable traffic sign recognition and robust lane detection capabilities to ensure safe navigation in complex and dynamic environments. This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception. For traffic sign recognition, we systematically evaluate ResNet-50, YOLOv8, and RT-DETR, achieving state-of-the-art performance of 99.8% with ResNet-50, 98.0% accuracy with YOLOv8, and achieved 96.6% accuracy in RT-DETR despite its higher computational complexity. For lane detection, we propose a CNN-based segmentation method enhanced by polynomial curve fitting, which delivers high accuracy under favorable conditions. Furthermore, we introduce a lightweight, Multimodal, LLM-based framework that directly undergoes instruction tuning using small yet diverse datasets, eliminating the need for initial pretraining. This framework effectively handles various lane types, complex intersections, and merging zones, significantly enhancing lane detection reliability by reasoning under adverse conditions. Despite constraints in available training resources, our multimodal approach demonstrates advanced reasoning capabilities, achieving a Frame Overall Accuracy (FRM) of 53.87%, a Question Overall Accuracy (QNS) of 82.83%, lane detection accuracies of 99.6% in clear conditions and 93.0% at night, and robust performance in reasoning about lane invisibility due to rain (88.4%) or road degradation (95.6%). The proposed comprehensive framework markedly enhances AV perception reliability, thus contributing significantly to safer autonomous driving across diverse and challenging road scenarios.
中文: 本文提出了一种结合深度学习与多模态大语言模型的综合框架,显著提升了自动驾驶车辆的感知能力,在交通标志识别和复杂场景下的车道检测方面均实现了优异的准确率。
English: This paper presents an integrated framework using deep learning and multimodal large language models to enhance autonomous vehicle perception, achieving high accuracy in traffic sign recognition and robust lane detection under various challenging conditions.

Authors:Wei Liu, Zhiying Deng, Zhongyu Niu, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li
Title: Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization
Abstract:
Extracting a small subset of crucial rationales from the full input is a key problem in explainability research. The most widely used fundamental criterion for rationale extraction is the maximum mutual information (MMI) criterion. In this paper, we first demonstrate that MMI suffers from diminishing marginal returns. Once part of the rationale has been identified, finding the remaining portions contributes only marginally to increasing the mutual information, making it difficult to use MMI to locate the rest. In contrast to MMI that aims to reproduce the prediction, we seek to identify the parts of the input that the network can actually utilize. This is achieved by comparing how different rationale candidates match the capability space of the weight matrix. The weight matrix of a neural network is typically low-rank, meaning that the linear combinations of its column vectors can only cover part of the directions in a high-dimensional space (high-dimension: the dimensions of an input vector). If an input is fully utilized by the network, {it generally matches these directions (e.g., a portion of a hypersphere), resulting in a representation with a high norm. Conversely, if an input primarily falls outside (orthogonal to) these directions}, its representation norm will approach zero, behaving like noise that the network cannot effectively utilize. Building on this, we propose using the norms of rationale candidates as an alternative objective to MMI. Through experiments on four text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our method outperforms MMI and its improved variants in identifying better rationales. We also compare our method with a representative LLM (llama-3.1-8b-instruct) and find that our simple method gets comparable results to it and can sometimes even outperform it.
中文摘要:本文提出了一种新的理由提取方法,通过表示范数识别网络可实际利用的输入部分,在多个数据集和架构上优于最大互信息准则及其改进版本。
English Summary: This paper proposes a new method for rationale extraction that uses the representation norm to identify input parts the network can utilize, outperforming the maximum mutual information criterion and its variants across multiple datasets and architectures.

Authors:Yunfan Jiang, Ruohan Zhang, Josiah Wong, Chen Wang, Yanjie Ze, Hang Yin, Cem Gokmen, Shuran Song, Jiajun Wu, Li Fei-Fei
Title: BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities
Abstract:
Real-world household tasks present significant challenges for mobile manipulation robots. An analysis of existing robotics benchmarks reveals that successful task performance hinges on three key whole-body control capabilities: bimanual coordination, stable and precise navigation, and extensive end-effector reachability. Achieving these capabilities requires careful hardware design, but the resulting system complexity further complicates visuomotor policy learning. To address these challenges, we introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for whole-body manipulation in diverse household tasks. Built on a bimanual, wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body teleoperation interface for data collection and a novel algorithm for learning whole-body visuomotor policies. We evaluate BRS on five challenging household tasks that not only emphasize the three core capabilities but also introduce additional complexities, such as long-range navigation, interaction with articulated and deformable objects, and manipulation in confined spaces. We believe that BRS's integrated robotic embodiment, data collection interface, and learning framework mark a significant step toward enabling real-world whole-body manipulation for everyday household tasks. BRS is open-sourced at https://behavior-robot-suite.github.io/
中文摘要:BEHAVIOR Robot Suite (BRS) 提出了一套集成框架,包含双臂轮式机器人躯干系统、遥操作接口及学习算法,旨在解决家庭复杂任务中全身操控的三大核心能力挑战。
English Summary: The BEHAVIOR Robot Suite (BRS) is introduced as an integrated framework featuring a bimanual wheeled robot with torso mobility, a teleoperation interface, and a learning algorithm to address whole-body manipulation challenges in complex household tasks.

Authors:Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, Hui Xiong
Title: Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking
Abstract:
As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.
中文: 本研究针对开源大语言模型的滥用检测难题,评估了后门水印和推理时水印蒸馏方法,发现前者能有效检测知识产权侵权,后者虽适用但微调鲁棒性较差且对模型性能影响更大。
English: This study addresses the challenge of detecting misuse in open-source large language models by evaluating backdoor watermarking and inference-time watermark distillation, finding the former more effective for intellectual property violation detection and the latter applicable but less robust to fine-tuning.

Authors:Van Bach Nguyen, Christin Seifert, Jörg Schlötterer
Title: Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification
Abstract:
The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.
中文摘要:本文提出了两种简单的分类器引导方法,使大语言模型无需微调即可生成高质量的反事实解释,在保持其文本生成优势的同时超越了现有方法,并通过数据增强提升了分类器的鲁棒性。
English Summary: This paper introduces two simple classifier-guided methods that enable Large Language Models to generate high-quality counterfactual explanations without fine-tuning, outperforming existing approaches while maintaining LLMs' text generation strengths and enhancing classifier robustness through data augmentation.

Authors:Jun Li, Che Liu, Wenjia Bai, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel
Title: Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
Abstract:
Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images.We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.
中文: 本研究通过将医学概念分解为基本属性和视觉模式的新方法,提升了视觉语言模型在医学异常检测和定位中的性能,仅用少量训练数据即达到与大型模型相当的效果,并展现出强大的泛化能力。
English: This study introduces a novel approach that enhances Visual Language Models' performance in medical abnormality detection and localization by decomposing medical concepts into fundamental attributes and visual patterns, achieving comparable results to larger models with significantly less training data while demonstrating strong generalization capabilities.

Authors:Faryal Batool, Malaika Zafar, Yasheerah Yaqoot, Roohan Ahmed Khan, Muhammad Haris Khan, Aleksey Fedoseev, Dzmitry Tsetserukou
Title: ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment
Abstract:
Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Model (VLM) with retrieval-augmented generation (RAG) to enable real-time reasoning for adaptive navigation of mini-drone swarms in complex environments. The key innovation of ImpedanceGPT lies in the integration of VLM and RAG, which provides the drones with enhanced semantic understanding of their surroundings. This enables the system to dynamically adjust impedance control parameters in response to obstacle types and environmental conditions. Our approach not only ensures safe and precise navigation but also improves coordination between drones in the swarm. Experimental evaluations demonstrate the effectiveness of the system. The VLM-RAG framework achieved an obstacle detection and retrieval accuracy of 80 % under optimal lighting. In static environments, drones navigated dynamic inanimate obstacles at 1.4 m/s but slowed to 0.7 m/s with increased separation around humans. In dynamic environments, speed adjusted to 1.0 m/s near hard obstacles, while reducing to 0.6 m/s with higher deflection to safely avoid moving humans.
中文: 本文提出ImpedanceGPT系统,通过融合视觉语言模型与检索增强生成技术,实现无人机群在复杂环境中的实时自适应导航,能根据障碍物类型和环境条件动态调整阻抗控制参数,有效提升飞行安全性与协同效率。
English: This paper introduces ImpedanceGPT, a novel system integrating Vision-Language Model and retrieval-augmented generation to enable real-time adaptive navigation for drone swarms, enhancing safety and coordination by dynamically adjusting control parameters based on obstacle types and environmental conditions.

Authors:Faryal Batool, Yasheerah Yaqoot, Malaika Zafar, Roohan Ahmed Khan, Muhammad Haris Khan, Aleksey Fedoseev, Dzmitry Tsetserukou
Title: ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment
Abstract:
Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments filled with both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that combines a Vision-Language Model (VLM) with retrieval-augmented generation (RAG) to enable real-time reasoning for adaptive navigation of mini-drone swarms in complex environments. The key innovation of ImpedanceGPT lies in the integration of VLM and RAG, which provides the drones with enhanced semantic understanding of their surroundings. This enables the system to dynamically adjust impedance control parameters in response to obstacle types and environmental conditions. Our approach not only ensures safe and precise navigation but also improves coordination between drones in the swarm. Experimental evaluations demonstrate the effectiveness of the system. The VLM-RAG framework achieved an obstacle detection and retrieval accuracy of 80 % under optimal lighting. In static environments, drones navigated dynamic inanimate obstacles at 1.4 m/s but slowed to 0.7 m/s with increased separation around humans. In dynamic environments, speed adjusted to 1.0 m/s near hard obstacles, while reducing to 0.6 m/s with higher deflection to safely avoid moving humans.
中文: 本文提出ImpedanceGPT系统,通过融合视觉语言模型与检索增强生成技术,实现无人机群在复杂环境中的实时自适应导航,能根据障碍物类型和环境条件动态调整阻抗控制参数,有效提升飞行安全性与协同效率。
English: This paper introduces ImpedanceGPT, a novel system integrating Vision-Language Model and retrieval-augmented generation to enable real-time adaptive navigation for drone swarms, enhancing safety and coordination by dynamically adjusting control parameters based on obstacle types and environmental conditions.

Authors:Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Arindam Mitra, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou
Title: Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abstract:
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
中文: Phi-4-Mini是一款紧凑型38亿参数语言模型,在数学和编程任务中表现出色;而Phi-4-Multimodal通过创新适配器整合文本、视觉和语音输入,在多模态任务中超越了许多大型模型。
English: Phi-4-Mini is a compact 3.8-billion-parameter language model excelling in math and coding tasks, while Phi-4-Multimodal integrates text, vision, and speech inputs using innovative adapters to outperform larger models across various benchmarks.

Authors:Teng Lin, Yizhang Zhu, Yuyu Luo, Nan Tang
Title: SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph
Abstract:
Multi-entity question answering (MEQA) poses significant challenges for large language models (LLMs), which often struggle to consolidate scattered information across multiple documents. An example question might be "What is the distribution of IEEE Fellows among various fields of study?", which requires retrieving information from diverse sources e.g., Wikipedia pages. The effectiveness of current retrieval-augmented generation (RAG) methods is limited by the LLMs' capacity to aggregate insights from numerous pages. To address this gap, this paper introduces a structured RAG (SRAG) framework that systematically organizes extracted entities into relational tables (e.g., tabulating entities with schema columns like "name" and "field of study") and then apply table-based reasoning techniques. Our approach decouples retrieval and reasoning, enabling LLMs to focus on structured data analysis rather than raw text aggregation. Extensive experiments on Wikipedia-based multi-entity QA tasks demonstrate that SRAG significantly outperforms state-of-the-art long-context LLMs and RAG solutions, achieving a 29.6% improvement in accuracy. The results underscore the efficacy of structuring unstructured data to enhance LLMs' reasoning capabilities.
中文摘要:本文提出的结构化RAG框架通过将实体组织成关系表来增强大语言模型在多实体问答中的推理能力,相比现有方法实现了29.6%的准确率提升。
English Summary: This paper introduces a Structured RAG (SRAG) framework that organizes entities into relational tables to enhance LLMs' reasoning in multi-entity QA, achieving a 29.6% accuracy improvement over existing methods.

Authors:Xiongfei Su, Siyuan Li, Yuning Cui, Miao Cao, Yulun Zhang, Zheng Chen, Zongliang Wu, Zedong Wang, Yuanlong Zhang, Xin Yuan
Title: Prior-guided Hierarchical Harmonization Network for Efficient Image Dehazing
Abstract:
Image dehazing is a crucial task that involves the enhancement of degraded images to recover their sharpness and textures. While vision Transformers have exhibited impressive results in diverse dehazing tasks, their quadratic complexity and lack of dehazing priors pose significant drawbacks for real-world applications. In this paper, guided by triple priors, Bright Channel Prior (BCP), Dark Channel Prior (DCP), and Histogram Equalization (HE), we propose a \textit{P}rior-\textit{g}uided Hierarchical \textit{H}armonization Network (PGH$^2$Net) for image dehazing. PGH$^2$Net is built upon the UNet-like architecture with an efficient encoder and decoder, consisting of two module types: (1) Prior aggregation module that injects B/DCP and selects diverse contexts with gating attention. (2) Feature harmonization modules that subtract low-frequency components from spatial and channel aspects and learn more informative feature distributions to equalize the feature maps.
中文摘要:本文提出PGH²Net,一种基于三重先验(亮通道、暗通道和直方图均衡)的层次协调网络,通过UNet架构整合先验信息并协调特征,有效提升图像去雾效果。
English Summary: This paper introduces PGH²Net, a prior-guided hierarchical network that integrates triple priors—Bright Channel, Dark Channel, and Histogram Equalization—into a UNet-like architecture to efficiently enhance image dehazing by aggregating priors and harmonizing features.

Authors:Junsong Zhang, Chunyu Lin, Zhijie Shen, Lang Nie, Kang Liao, Yao Zhao
Title: Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations
Abstract:
The performance of existing supervised layout estimation methods heavily relies on the quality of data annotations. However, obtaining large-scale and high-quality datasets remains a laborious and time-consuming challenge. To solve this problem, semi-supervised approaches are introduced to relieve the demand for expensive data annotations by encouraging the consistent results of unlabeled data with different perturbations. However, existing solutions merely employ vanilla perturbations, ignoring the characteristics of panoramic layout estimation. In contrast, we propose a novel semi-supervised method named SemiLayout360, which incorporates the priors of the panoramic layout and distortion through collaborative perturbations. Specifically, we leverage the panoramic layout prior to enhance the model's focus on potential layout boundaries. Meanwhile, we introduce the panoramic distortion prior to strengthen distortion awareness. Furthermore, to prevent intense perturbations from hindering model convergence and ensure the effectiveness of prior-based perturbations, we divide and reorganize them as panoramic collaborative perturbations. Our experimental results on three mainstream benchmarks demonstrate that the proposed method offers significant advantages over existing state-of-the-art (SoTA) solutions.
中文摘要:现有监督式布局估计方法依赖数据标注质量,但SemiLayout360通过全景布局和畸变先验的协同扰动提出半监督方法,在主流基准测试中显著优于现有最优解决方案。
English Summary: Existing supervised layout estimation methods depend on data annotation quality, but SemiLayout360 introduces a semi-supervised approach with panoramic layout and distortion priors through collaborative perturbations, outperforming state-of-the-art methods on benchmarks.

Authors:Jinyu Miao, Tuopu Wen, Kun Jiang, Kangan Qian, Zheng Fu, Yunlong Wang, Zhihuang Zhang, Mengmeng Yang, Jin Huang, Zhihua Zhong, Diange Yang
Title: RAVE: End-to-end Hierarchical Visual Localization with Rasterized and Vectorized HD map
Abstract:
Accurate localization serves as an important component in autonomous driving systems. Traditional rule-based localization involves many standalone modules, which is theoretically fragile and requires costly hyperparameter tuning, therefore sacrificing the accuracy and generalization. In this paper, we propose an end-to-end visual localization system, RAVE, in which the surrounding images are associated with the HD map data to estimate pose. To ensure high-quality observations for localization, a low-rank flow-based prior fusion module (FLORA) is developed to incorporate misaligned map prior into the perceived BEV features. Pursuing a balance among efficiency, interpretability, and accuracy, a hierarchical localization module is proposed, which efficiently estimates poses through a decoupled BEV neural matching-based pose solver (DEMA) using rasterized HD map, and then refines the estimation through a Transformer-based pose regressor (POET) using vectorized HD map. The experimental results demonstrate that our method can perform robust and accurate localization under varying environmental conditions while running efficiently.
中文: 本文提出RAVE,一种用于自动驾驶的端到端视觉定位系统,通过FLORA模块和分层定位方法结合DEMA与POET,利用高清地图数据与周围图像关联,实现在多变环境下高效、鲁棒且精确的位姿估计。
English: The paper introduces RAVE, an end-to-end visual localization system for autonomous driving that integrates surrounding images with HD map data using modules like FLORA and a hierarchical approach with DEMA and POET to achieve robust, accurate, and efficient pose estimation under diverse conditions.

Authors:Yao Yao, Yifei Yang, Xinbei Ma, Dongjie Yang, Zhuosheng Zhang, Zuchao Li, Hai Zhao
Title: How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition
Abstract:
How human cognitive abilities are formed has long captivated researchers. However, a significant challenge lies in developing meaningful methods to measure these complex processes. With the advent of large language models (LLMs), which now rival human capabilities in various domains, we are presented with a unique testbed to investigate human cognition through a new lens. Among the many facets of cognition, one particularly crucial aspect is the concept of semantic size, the perceived magnitude of both abstract and concrete words or concepts. This study seeks to investigate whether LLMs exhibit similar tendencies in understanding semantic size, thereby providing insights into the underlying mechanisms of human cognition. We begin by exploring metaphorical reasoning, comparing how LLMs and humans associate abstract words with concrete objects of varying sizes. Next, we examine LLMs' internal representations to evaluate their alignment with human cognitive processes. Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding, suggesting that real-world, multi-modal experiences are similarly vital for human cognitive development. Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario. The results show that multi-modal LLMs are more emotionally engaged in decision-making, but this also introduces potential biases, such as the risk of manipulation through clickbait headlines. Ultimately, this study offers a novel perspective on how LLMs interpret and internalize language, from the smallest concrete objects to the most profound abstract concepts like love. The insights gained not only improve our understanding of LLMs but also provide new avenues for exploring the cognitive abilities that define human intelligence.
中文: 本研究探讨大语言模型是否具备类似人类的语义尺寸理解能力,发现多模态训练能增强其与人类认知的契合度,但同时也带来易受点击诱饵影响等偏见。
English: This study investigates whether large language models (LLMs) exhibit human-like understanding of semantic size, revealing that multi-modal training enhances their cognitive alignment with humans while also introducing biases like susceptibility to clickbait.

Authors:Zhiyuan Xu, Yinuo Zhao, Kun Wu, Ning Liu, Junjie Ji, Zhengping Che, Chi Harold Liu, Jian Tang
Title: HACTS: a Human-As-Copilot Teleoperation System for Robot Learning
Abstract:
Teleoperation is essential for autonomous robot learning, especially in manipulation tasks that require human demonstrations or corrections. However, most existing systems only offer unilateral robot control and lack the ability to synchronize the robot's status with the teleoperation hardware, preventing real-time, flexible intervention. In this work, we introduce HACTS (Human-As-Copilot Teleoperation System), a novel system that establishes bilateral, real-time joint synchronization between a robot arm and teleoperation hardware. This simple yet effective feedback mechanism, akin to a steering wheel in autonomous vehicles, enables the human copilot to intervene seamlessly while collecting action-correction data for future learning. Implemented using 3D-printed components and low-cost, off-the-shelf motors, HACTS is both accessible and scalable. Our experiments show that HACTS significantly enhances performance in imitation learning (IL) and reinforcement learning (RL) tasks, boosting IL recovery capabilities and data efficiency, and facilitating human-in-the-loop RL. HACTS paves the way for more effective and interactive human-robot collaboration and data-collection, advancing the capabilities of robot manipulation.
中文: HACTS是一种创新的遥操作系统,通过建立机器人手臂与遥操作硬件间的双向实时同步,实现无缝人工干预,并显著提升模仿学习和强化学习任务中的机器人性能。
English: HACTS is a novel teleoperation system that enables bilateral, real-time synchronization between a robot arm and teleoperation hardware, allowing seamless human intervention and improving robot learning performance in imitation and reinforcement learning tasks.

Authors:Takeshi Noda, Chao Chen, Junsheng Zhou, Weiqi Zhang, Yu-Shen Liu, Zhizhong Han
Title: Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation
Abstract:
Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF
Chinese: 本文提出了一种新颖方法,通过学习动态变形网络和双射表面参数化从稀疏点云预测有符号距离函数,在表面重建方面显著优于现有技术。
English: This paper introduces a novel method that learns a dynamic deformation network and bijective surface parameterization to predict signed distance functions from sparse point clouds, significantly outperforming existing techniques in surface reconstruction.

Authors:Wei Tao, Bin Zhang, Xiaoyang Qu, Jiguang Wan, Jianzong Wang
Title: Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Abstract:
Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
中文: 本文提出Cocktail方法,通过块级自适应混合精度量化优化大语言模型的键值缓存,基于上下文块与查询的相似度快速确定最优位宽配置以保持模型精度,并通过重排缓存块避免硬件低效问题。
English: This paper introduces Cocktail, a method that uses chunk-adaptive mixed-precision quantization to optimize the key-value cache in large language models, enhancing efficiency and maintaining accuracy by quickly determining optimal bitwidth configurations and reordering chunks for hardware-friendly computation.

Authors:Bin Zhang, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang
Title: VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection
Abstract:
As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
中文: 本研究提出了一种利用视觉提示和文本增强的分布内空间构建的新方法,使CLIP模型能够适应零样本物体级分布外检测,有效保留上下文信息并在多个基准测试中提升检测性能。
English: This study introduces a novel method using visual prompts and text-augmented in-distribution space construction to adapt CLIP for zero-shot object-level out-of-distribution detection, effectively preserving context and enhancing detection performance across benchmarks.

Authors:Bingyan Xie, Yongpeng Wu, Yuxuan Shi, Biqian Feng, Wenjun Zhang, Jihong Park, Tony Q. S. Quek
Title: WVSC: Wireless Video Semantic Communication with Multi-frame Compensation
Abstract:
Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, multi-frame compensation (MFC) is proposed to produce compensated current semantic frame with a multi-frame fusion attention module. With both the reference frame transmission and MFC, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC over other DL-based methods e.g. DVSC about 1 dB and traditional schemes about 2 dB in terms of PSNR.
中文: 提出的WVSC框架通过将视频编码从像素级提升至语义级,采用参考语义帧和多帧补偿技术,在显著降低带宽需求的同时,其性能优于现有深度学习和传统方案约1-2 dB。
English: The proposed WVSC framework enables wireless video transmission by encoding videos at the semantic level rather than pixel level, utilizing reference semantic frames and multi-frame compensation to significantly reduce bandwidth while improving performance over existing methods.

Authors:Jinwei Li, Huan-ang Gao, Wenyi Li, Haohan Chi, Chenyu Liu, Chenxi Du, Yiqian Liu, Mingju Gao, Guiyu Zhang, Zongzheng Zhang, Li Yi, Yao Yao, Jingwei Zhao, Hongyang Li, Yikai Wang, Hao Zhao
Title: FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks
Abstract:
With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.
中文: FB-4D框架通过引入特征库机制,存储并融合先前帧的特征来增强动态3D生成中的时空一致性,在渲染质量上显著优于现有免调优方法,并达到与基于训练方法相当的性能。
English: The FB-4D framework introduces a Feature Bank mechanism that stores and merges features from previous frames to enhance spatial-temporal consistency in dynamic 3D generation, achieving superior rendering quality and outperforming existing tuning-free methods while matching training-based approaches.

Authors:Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi
Title: Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
Abstract:
Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
中文: 当前二维和多模态模型虽在大型数据集上表现卓越,但受限于标注数据不足,难以实现复杂三维/四维场景的自由交互;为此,Feature4X框架应运而生,它通过单目视频和动态优化将二维视觉模型功能扩展至四维领域,支持多样化任务。
English: Recent 2D and multimodal models excel with large datasets but struggle with complex 3D/4D scene interactions due to limited annotated data, prompting the introduction of Feature4X, a universal framework that extends 2D vision model capabilities into 4D using monocular video and dynamic optimization for versatile tasks.

Authors:Yiran Zhang, Ruiyin Li, Peng Liang, Weisong Sun, Yang Liu
Title: Knowledge-Based Multi-Agent Framework for Automated Software Architecture Design
Abstract:
Architecture design is a critical step in software development. However, creating a high-quality architecture is often costly due to the significant need for human expertise and manual effort. Recently, agents built upon Large Language Models (LLMs) have achieved remarkable success in various software engineering tasks. Despite this progress, the use of agents to automate the architecture design process remains largely unexplored. To address this gap, we envision a Knowledge-based Multi-Agent Architecture Design (MAAD) framework. MAAD uses agents to simulate human roles in the traditional software architecture design process, thereby automating the design process. To empower these agents, MAAD incorporates knowledge extracted from three key sources: 1) existing system designs, 2) authoritative literature, and 3) architecture experts. By envisioning the MAAD framework, we aim to advance the full automation of application-level system development.
中文: MAAD框架利用基于大语言模型的智能体,通过模拟人类角色并整合来自系统设计、权威文献和专家的知识,实现软件架构设计的自动化,旨在推动系统开发的全自动化进程。
English: The MAAD framework utilizes LLM-based agents to automate software architecture design by simulating human roles and integrating knowledge from system designs, literature, and experts, aiming to advance full automation in system development.

Authors:Ke Ma, Jiaqi Tang, Bin Guo, Fan Dang, Sicong Liu, Zhui Zhu, Lei Wu, Cheng Fang, Ying-Cong Chen, Zhiwen Yu, Yunhao Liu
Title: SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity
Abstract:
Despite the growing integration of deep models into mobile terminals, the accuracy of these models declines significantly due to various deployment interferences. Test-time adaptation (TTA) has emerged to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in resource-constrained terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements during fully test-time adaptation (FTTA) without relying on specific network architectures or modifications to the original training procedure. Specifically, we propose a novel dynamic activation sparsity strategy that directly prunes activations at layer-specific dynamic ratios during adaptation, allowing for flexible control of learning ability and memory cost in a data-sensitive manner. Among this, two metrics, Gradient Importance and Layer Activation Memory, are considered to determine the layer-wise pruning ratios, reflecting accuracy contribution and memory efficiency, respectively. Experimentally, our method surpasses the baselines by not only reducing memory usage but also achieving superior accuracy, delivering SOTA performance across diverse datasets, architectures, and tasks.
Chinese: SURGEON是一种创新的测试时适应方法,通过基于梯度重要性和层内存效率的动态激活剪枝,在保持高精度的同时显著降低了内存消耗,适用于多种数据集和任务。
English: SURGEON is a novel test-time adaptation method that significantly cuts memory costs by dynamically pruning activations based on gradient importance and layer memory efficiency, while maintaining high accuracy across various datasets and tasks.

Authors:Franck Cappello, Allison Baker, Ebru Bozda, Martin Burtscher, Kyle Chard, Sheng Di, Paul Christopher O Grady, Peng Jiang, Shaomeng Li, Erik Lindahl, Peter Lindstrom, Magnus Lundborg, Kai Zhao, Xin Liang, Masaru Nagaso, Kento Sato, Amarjit Singh, Seung Woo Son, Dingwen Tao, Jiannan Tian, Robert Underwood, Kazutomo Yoshii, Danylo Lykov, Yuri Alexeev, Kyle Gerard Felker
Title: Lossy Compression of Scientific Data: Applications Constrains and Requirements
Abstract:
Increasing data volumes from scientific simulations and instruments (supercomputers, accelerators, telescopes) often exceed network, storage, and analysis capabilities. The scientific community's response to this challenge is scientific data reduction. Reduction can take many forms, such as triggering, sampling, filtering, quantization, and dimensionality reduction. This report focuses on a specific technique: lossy compression. Lossy compression retains all data points, leveraging correlations and controlled reduced accuracy. Quality constraints, especially for quantities of interest, are crucial for preserving scientific discoveries. User requirements also include compression ratio and speed. While many papers have been published on lossy compression techniques and reference datasets are shared by the community, there is a lack of detailed specifications of application needs that can guide lossy compression researchers and developers. This report fills this gap by reporting on the requirements and constraints of nine scientific applications covering a large spectrum of domains (climate, combustion, cosmology, fusion, light sources, molecular dynamics, quantum circuit simulation, seismology, and system logs). The report also details key lossy compression technologies (SZ, ZFP, MGARD, LC, SPERR, DCTZ, TEZip, LibPressio), discussing their history, principles, error control, hardware support, features, and impact. By presenting both application needs and compression technologies, the report aims to inspire new research to fill existing gaps.
Chinese: 本报告通过详细阐述九个科学领域的需求并分析关键有损压缩技术,旨在弥合应用需求与压缩技术之间的差距,激发新的研究方向。
English: This report addresses the gap between scientific applications' needs and lossy compression technologies by detailing requirements from nine domains and analyzing key compression methods to inspire future research.

Authors:Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, Hao Zhao
Title: PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model
Abstract:
As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model's understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.
中文:PartRM提出了一种新颖的4D重建框架,通过多视角图像同时建模静态物体的外观、几何和部件级运动,克服了现有方法的局限性,在运动学习和机器人操作应用中实现了最先进的性能。
English: PartRM introduces a novel 4D reconstruction framework that overcomes limitations of existing methods by modeling appearance, geometry, and part-level motion from multi-view images, achieving state-of-the-art performance in motion learning and robotic manipulation applications.

Authors:Shujuan Li, Yu-Shen Liu, Zhizhong Han
Title: GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
Abstract:
Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representation, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries.
中文摘要:本研究提出了一种创新方法,通过优化表面上的二维高斯平面来弥合三维高斯泼溅与无符号距离函数之间的差距,利用新型约束和自监督策略实现了开放表面重建在精度和完整性上的显著提升。
English Summary: This study introduces a novel method that bridges 3D Gaussian splatting and unsigned distance functions by optimizing 2D Gaussian planes on surfaces, achieving superior reconstruction of open surfaces through innovative constraints and self-supervision.

Authors:Wenyuan Zhang, Yixiao Yang, Han Huang, Liang Han, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Title: MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
Abstract:
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
中文:MonoInstance通过在多视角下对齐实例深度并引入不确定性约束,有效解决了单眼深度先验的不一致性问题,从而显著提升了神经渲染与三维重建的性能。
English: MonoInstance enhances neural rendering and 3D reconstruction by addressing inconsistencies in monocular depth priors through uncertainty-based alignment of instance depths across multiple views and integrating a constraint for unreliable areas.

Authors:Wenyuan Zhang, Emily Yue-ting Jia, Junsheng Zhou, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Title: NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
Abstract:
Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks.
中文摘要:NeRFPrior采用神经辐射场作为先验,通过体渲染学习符号距离场进行表面重建,提供几何和颜色线索,无需额外数据即可快速训练,并在多视角一致性和深度一致性损失下实现卓越性能。
English Summary: NeRFPrior introduces a neural radiance field as a prior that provides both geometric and color clues for surface reconstruction, enabling fast training and superior performance through multi-view consistency and depth consistency loss.

Authors:Adarsh Salagame, Sasank Potluri, Keshav Bharadwaj Vaidyanathan, Kruthika Gangaraju, Eric Sihite, Milad Ramezani, Alireza Ramezani
Title: Vision-Guided Loco-Manipulation with a Snake Robot
Abstract:
This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University's snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-manipulation for transporting objects to specified goal locations. Additionally, we demonstrate open-loop experiments in which COBRA successfully performs real-time object detection and loco-manipulation tasks.
中文摘要:本文介绍了为东北大学COBRA蛇形机器人开发的视觉引导运动操作系统,通过闭环控制实现自主目标检测与物体搬运,并在开环实验中验证了实时操作能力。
English Summary: This paper details the development of a vision-guided system for Northeastern University's COBRA snake robot, enabling real-time object pose estimation and autonomous loco-manipulation through closed-loop control and open-loop experiments.

Authors:Adarsh Salagame, Shashwat Pandya, Ioannis Mandralis, Eric Sihite, Alireza Ramezani, Morteza Gharib
Title: NMPC-based Unified Posture Manipulation and Thrust Vectoring for Fault Recovery
Abstract:
Multi-rotors face significant risks, as actuator failures at high altitudes can easily result in a crash and the robot's destruction. Therefore, rapid fault recovery in the event of an actuator failure is necessary for the fault-tolerant and safe operation of unmanned aerial robots. In this work, we present a fault recovery approach based on the unification of posture manipulation and thrust vectoring. The key contributions of this work are: 1) Derivation of two flight dynamics models (high-fidelity and reduced-order) that capture posture control and thrust vectoring. 2) Design of a controller based on Nonlinear Model Predictive Control (NMPC) and demonstration of fault recovery in simulation using a high-fidelity model of the Multi-Modal Mobility Morphobot (M4) in Simscape.
中文: 本研究提出了一种结合姿态操纵与推力矢量控制的故障恢复方法,通过非线性模型预测控制器和高保真仿真,确保多旋翼无人机在发生执行器故障后仍能安全运行。
English: This work presents a fault recovery approach for multi-rotors that unifies posture manipulation and thrust vectoring, using NMPC-based controllers and high-fidelity simulations to ensure safe operation after actuator failures.

Authors:Bojun Liu, Yangzhi Ma, Ao Luo, Li Li, Dong Liu
Title: Voxel-based Point Cloud Geometry Compression with Space-to-Channel Context
Abstract:
Voxel-based methods are among the most efficient for point cloud geometry compression, particularly with dense point clouds. However, they face limitations due to a restricted receptive field, especially when handling high-bit depth point clouds. To overcome this issue, we introduce a stage-wise Space-to-Channel (S2C) context model for both dense point clouds and low-level sparse point clouds. This model utilizes a channel-wise autoregressive strategy to effectively integrate neighborhood information at a coarse resolution. For high-level sparse point clouds, we further propose a level-wise S2C context model that addresses resolution limitations by incorporating Geometry Residual Coding (GRC) for consistent-resolution cross-level prediction. Additionally, we use the spherical coordinate system for its compact representation and enhance our GRC approach with a Residual Probability Approximation (RPA) module, which features a large kernel size. Experimental results show that our S2C context model not only achieves bit savings while maintaining or improving reconstruction quality but also reduces computational complexity compared to state-of-the-art voxel-based compression methods.
中文摘要:我们提出的分阶段和分层空间到通道上下文模型通过整合邻域信息和几何残差编码,克服了基于体素的点云压缩中感受野受限的问题,在保持重建质量的同时实现了更优的比特节省和更低的计算复杂度。
English Summary: Our proposed stage-wise and level-wise Space-to-Channel context models overcome receptive field limitations in voxel-based point cloud compression by integrating neighborhood information and geometry residual coding, achieving superior bit savings and reduced complexity while maintaining reconstruction quality.

Authors:Junyuan Gao, Yongpeng Wu, Giuseppe Caire, Wei Yang, H. Vincent Poor, Wenjun Zhang
Title: Unsourced Random Access in MIMO Quasi-Static Rayleigh Fading Channels: Finite Blocklength and Scaling Law Analyses
Abstract:
This paper considers the unsourced random access (URA) problem with a random and unknown number of active users in multiple-input multiple-output (MIMO) quasi-static Rayleigh fading channels. We derive non-asymptotic achievability bounds on the probability of incorrectly estimating the number of active users, and provide scaling laws on the gap between the estimated and true numbers of active users. We prove that the error probability reaches a plateau as the power $P$ and blocklength $n$ increase, whereas it decays exponentially with the number $L$ of receive antennas and eventually vanishes. Then, we explore the fundamental limits of URA by deriving non-asymptotic achievability bounds and converse bounds (including two single-user converse bounds and one multi-user ensemble converse bound) on the minimum energy-per-bit required by each active user to transmit $J$ bits with blocklength $n$ under misdetection and false-alarm constraints. Numerical results show that the extra required energy-per-bit due to the uncertainty in the number ${\rm{K}}_a$ of active users decreases as $L$ and $\mathbb{E}[{\rm{K}}_a]$ increase and the error requirement becomes milder. In the non-asymptotic regime, using codewords distributed on a sphere outperforms Gaussian random coding. Existing schemes are shown to exhibit a large gap to our bounds when the number of active users is large, calling for more advanced schemes that perform energy-efficiently in this case. In the asymptotic regime with $n\to\infty$, we establish scaling laws on the minimum required $P$ and $L$ to reliably support ${\rm{K}}_a$ active users as functions of $n$, which highlight the potential of MIMO in enabling low-cost communication and indicate that it is possible for the minimum required $P$ and $L$ to remain on the same order when the number of active users increases but stays below a threshold.
本文研究了多输入多输出系统中活跃用户数未知的无源随机接入问题,建立了性能界限并证明误码率随功率和块长度增加而趋于稳定,但随天线数量增加呈指数下降直至消失,同时揭示了在非渐近区域球面码字优于高斯随机编码。
This paper analyzes unsourced random access in MIMO systems with unknown active users, establishing performance bounds and demonstrating how error probability plateaus with power and blocklength but vanishes with more antennas, while also revealing that spherical codewords outperform Gaussian coding in non-asymptotic regimes.

Authors:Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, Luc Van Gool
Title: UniK3D: Universal Camera Monocular 3D Estimation
Abstract:
Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .
Chinese Summary: UniK3D提出了一种通用的单目3D估计方法,通过球面3D表示和学习型球谐函数突破现有方法局限,能在任意相机模型下实现精确三维重建,在13个不同数据集上展现出最先进的性能表现。
English Summary: UniK3D introduces a novel generalizable method for monocular 3D estimation that overcomes limitations of existing approaches by modeling any camera type through spherical 3D representation and learned spherical harmonics, achieving state-of-the-art performance across diverse datasets.

Authors:Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Title: SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
Abstract:
While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.
中文摘要:针对高龄人群语音数据匮乏的问题,SeniorTalk数据集通过收录202位75岁以上老人的55.53小时标注对话,为开发适老语音技术提供了关键支持,并通过多维度实验验证其应用价值。
English Summary: The SeniorTalk dataset addresses the scarcity of elderly speech data by providing 55.53 hours of annotated Chinese dialogues from individuals aged 75+, enabling improved voice technology development for aging populations through multi-dimensional experiments.

Authors:Markus Karmann, Peng-Tao Jiang, Bo Li, Onay Urfalioglu
Title: M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation
Abstract:
We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.
中文: M2N2V2是一种无监督、无需训练的点提示交互式分割方法,通过引入深度引导和自适应评分函数来提升分割稳定性与性能,在多个数据集上取得了与监督方法相媲美的结果。
English: M2N2V2 is an unsupervised, training-free interactive segmentation method that integrates depth guidance and an adaptive score function to enhance segmentation stability and performance, achieving competitive results with supervised approaches on standard benchmarks.

Authors:Yanchen Luo, Zhiyuan Liu, Yi Zhao, Sihang Li, Hengxing Cai, Kenji Kawaguchi, Tat-Seng Chua, Yang Zhang, Xiang Wang
Title: Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling
Abstract:
3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6\% over the previous best result, while achieving over 70\% relative average improvements in geometric fidelity.
中文: 本研究提出的UAE-3D模型将多模态分子数据统一到单一潜空间,通过潜扩散建模实现了高效的3D分子生成,在质量和几何保真度上均取得显著突破,创造了新的性能标杆。
English: The proposed UAE-3D model unifies multi-modal molecular data into a single latent space, enabling efficient latent diffusion modeling that achieves state-of-the-art performance in 3D molecule generation with significant improvements in both quality and geometric fidelity.

Authors:Vaibhav Aggarwal, Ojasv Kamal, Abhinav Japesh, Zhijing Jin, Bernhard Schölkopf
Title: DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal
Abstract:
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
中文摘要:大型语言模型通过自动化革新了软件工程,但现有编码代理常面临决策不佳的问题,而动态动作重采样(DARS)方法通过在关键决策点分支轨迹,能更快更有效地从不良决策中恢复,在SWE-Bench Lite基准测试中表现优异。
English Summary: Large Language Models have transformed software engineering by enabling automation, but existing coding agents often face sub-optimal decision-making issues, which the new Dynamic Action Re-Sampling (DARS) approach effectively addresses by branching trajectories at key decision points for faster and more efficient recovery.

Authors:Hao Li, Yubin Xiao, Ke Liang, Mengzhu Wang, Long Lan, Kenli Li, Xinwang Liu
Title: Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization
Abstract:
Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.
Chinese: 提出的DRSF框架利用潜在扩散模型生成多样化伪目标域,并通过判别性特征重校准和软融合模块缓解合成与真实数据的分布差异,在单域泛化任务中以微小计算开销实现显著性能提升。
English: The proposed DRSF framework leverages latent diffusion models to generate diverse pseudo-target domains and introduces discriminative feature recalibration and soft fusion modules to mitigate synthetic-to-real distribution discrepancies, achieving significant performance improvements in single domain generalization tasks with minimal computational cost.

Authors:Umang Bhatt, Sanyam Kapoor, Mihir Upadhyay, Ilia Sucholutsky, Francesco Quinzan, Katherine M. Collins, Adrian Weller, Andrew Gordon Wilson, Muhammad Bilal Zafar
Title: When Should We Orchestrate Multiple Agents?
Abstract:
Strategies for orchestrating the interactions between multiple agents, both human and artificial, can wildly overestimate performance and underestimate the cost of orchestration. We design a framework to orchestrate agents under realistic conditions, such as inference costs or availability constraints. We show theoretically that orchestration is only effective if there are performance or cost differentials between agents. We then empirically demonstrate how orchestration between multiple agents can be helpful for selecting agents in a simulated environment, picking a learning strategy in the infamous Rogers' Paradox from social science, and outsourcing tasks to other agents during a question-answer task in a user study.
中文: 多智能体协同策略常高估性能并低估成本,而新设计的框架仅在智能体存在性能或成本差异时有效,这通过模拟环境和用户研究得到了验证。
English: Orchestration strategies for multiple agents often misjudge performance and costs, but a new framework proves effective only when agents differ in performance or cost, as shown in simulations and user studies.

Authors:Lukas Schulthess, Philipp Mayer, Christian Vogt, Luca Benini, Michele Magno
Title: BodySense: An Expandable and Wearable-Sized Wireless Evaluation Platform for Human Body Communication
Abstract:
Wearable, wirelessly connected sensors have become a common part of daily life and have the potential to play a pivotal role in shaping the future of personalized healthcare. A key challenge in this evolution is designing long-lasting and unobtrusive devices. These design requirements inherently demand smaller batteries, inevitably increasing the need for energy-sensitive wireless communication interfaces. Capacitive Human Body Communication (HBC) is a promising, power-efficient alternative to traditional RF-based communication, enabling point-to-multipoint data and energy exchange. However, as this concept relies on capacitive coupling to the surrounding area, it is naturally influenced by uncontrollable environmental factors, making testing with classical setups particularly challenging. This work presents a customizable, wearable-sized, wireless evaluation platform for capacitive HBC, designed to enable realistic evaluation of wearable-to-wearable applications. Comparative measurements of channel gains were conducted using classical grid-connected and wireless Data Acquisition (DAQ) across various transmission distances within the frequency range of 4 MHz to 64 MHz and revealed an average overestimation of 18.15 dB over all investigated distances in the classical setup.
Chinese: 可穿戴传感器需要节能通信,电容式人体通信是一种有前景的替代方案,但测试因环境因素而具有挑战性;本研究推出无线评估平台,发现传统设置平均高估了18.15分贝的信道增益。
English: Wearable sensors require energy-efficient communication, and capacitive Human Body Communication offers a promising alternative, though testing is challenging due to environmental factors; this study introduces a wireless evaluation platform that reveals classical setups overestimate channel gains by an average of 18.15 dB.

Authors:Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Title: BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
Abstract:
Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: https://liyaowei-stu.github.io/project/BlobCtrl/
中文: BlobCtrl提出了一种基于斑点的框架,通过双分支扩散模型和自监督训练解耦布局与外观,实现了精细的元素级图像编辑,在物体操作等任务中达到领先性能并保持高效性。
English: BlobCtrl introduces a blob-based framework that enables fine-grained, element-level image editing by decoupling layout and appearance through a dual-branch diffusion model and self-supervised training, achieving state-of-the-art performance in tasks like object manipulation while maintaining efficiency.

Authors:Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Title: BlobCtrl: Taming Controllable Blob for Element-level Image Editing
Abstract:
As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level manipulation. Our key contributions are twofold: (1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance, and (2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks, such as object addition, removal, scaling, and replacement, while maintaining computational efficiency. Project Webpage: https://liyaowei-stu.github.io/project/BlobCtrl/
中文: BlobCtrl提出了一种基于斑点的框架,通过双分支扩散模型和自监督训练解耦布局与外观,实现了精细的元素级图像编辑,在物体操作等任务中达到领先性能并保持高效性。
English: BlobCtrl introduces a blob-based framework that enables fine-grained, element-level image editing by decoupling layout and appearance through a dual-branch diffusion model and self-supervised training, achieving state-of-the-art performance in tasks like object manipulation while maintaining efficiency.

Authors:Kailin Li, Zhenxin Li, Shiyi Lan, Yuan Xie, Zhizhong Zhang, Jiayi Liu, Zuxuan Wu, Zhiding Yu, Jose M. Alvarez
Title: Hydra-MDP++: Advancing End-to-End Driving via Expert-Guided Hydra-Distillation
Abstract:
Hydra-MDP++ introduces a novel teacher-student knowledge distillation framework with a multi-head decoder that learns from human demonstrations and rule-based experts. Using a lightweight ResNet-34 network without complex components, the framework incorporates expanded evaluation metrics, including traffic light compliance (TL), lane-keeping ability (LK), and extended comfort (EC) to address unsafe behaviors not captured by traditional NAVSIM-derived teachers. Like other end-to-end autonomous driving approaches, \hydra processes raw images directly without relying on privileged perception signals. Hydra-MDP++ achieves state-of-the-art performance by integrating these components with a 91.0% drive score on NAVSIM through scaling to a V2-99 image encoder, demonstrating its effectiveness in handling diverse driving scenarios while maintaining computational efficiency.
Chinese: Hydra-MDP++ 提出了一种新颖的师生知识蒸馏框架,采用多头解码器从人类演示和基于规则的专家中学习,通过轻量级ResNet-34网络在NAVSIM上实现了91.0%的驾驶分数,在保持计算效率的同时达到了最先进的性能。
English: Hydra-MDP++ presents a teacher-student knowledge distillation framework with a multi-head decoder that learns from human demonstrations and rule-based experts, achieving state-of-the-art performance with a 91.0% drive score on NAVSIM while maintaining computational efficiency through a lightweight ResNet-34 network.

Authors:Zhe Yang, Yi Huang, Yaqin Chen, Xiaoting Wu, Junlan Feng, Chao Deng
Title: Palette of Language Models: A Solver for Controlled Text Generation
Abstract:
Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.
Chinese: 近期大语言模型的进步实现了可控文本生成,但优化多属性提示仍具挑战性,因此提出了一种基于全概率定律和条件互信息最小化的新型组合策略,在实验中取得了优异成果。
English: Recent advancements in large language models enable controlled text generation, but optimizing prompts for multiple attributes remains challenging, leading to the proposal of a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization that achieves superior results in experiments.

Authors:Sinuo Liu, Chenyang Lyu, Minghao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Zifu Shang
Title: New Trends for Modern Machine Translation with Large Reasoning Models
Abstract:
Recent advances in Large Reasoning Models (LRMs), particularly those leveraging Chain-of-Thought reasoning (CoT), have opened brand new possibility for Machine Translation (MT). This position paper argues that LRMs substantially transformed traditional neural MT as well as LLMs-based MT paradigms by reframing translation as a dynamic reasoning task that requires contextual, cultural, and linguistic understanding and reasoning. We identify three foundational shifts: 1) contextual coherence, where LRMs resolve ambiguities and preserve discourse structure through explicit reasoning over cross-sentence and complex context or even lack of context; 2) cultural intentionality, enabling models to adapt outputs by inferring speaker intent, audience expectations, and socio-linguistic norms; 3) self-reflection, LRMs can perform self-reflection during the inference time to correct the potential errors in translation especially extremely noisy cases, showing better robustness compared to simply mapping X->Y translation. We explore various scenarios in translation including stylized translation, document-level translation and multimodal translation by showcasing empirical examples that demonstrate the superiority of LRMs in translation. We also identify several interesting phenomenons for LRMs for MT including auto-pivot translation as well as the critical challenges such as over-localisation in translation and inference efficiency. In conclusion, we think that LRMs redefine translation systems not merely as text converters but as multilingual cognitive agents capable of reasoning about meaning beyond the text. This paradigm shift reminds us to think of problems in translation beyond traditional translation scenarios in a much broader context with LRMs - what we can achieve on top of it.
中文摘要:大型推理模型通过将翻译重构为动态推理任务,在上下文连贯性、文化意图性和自我反思方面实现了根本性转变,使翻译系统成为能够进行跨语言意义推理的多语言认知代理。
English Summary: Large Reasoning Models (LRMs) are transforming machine translation by reframing it as a dynamic reasoning task that enhances contextual coherence, cultural intentionality, and self-reflection, positioning translation systems as multilingual cognitive agents.

Authors:Ziqi Jia, Junjie Li, Xiaoyang Qu, Jianzong Wang
Title: Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy
Abstract:
Multi-agent systems (MAS) have shown great potential in executing complex tasks, but coordination and safety remain significant challenges. Multi-Agent Reinforcement Learning (MARL) offers a promising framework for agent collaboration, but it faces difficulties in handling complex tasks and designing reward functions. The introduction of Large Language Models (LLMs) has brought stronger reasoning and cognitive abilities to MAS, but existing LLM-based systems struggle to respond quickly and accurately in dynamic environments. To address these challenges, we propose LLM-based Graph Collaboration MARL (LGC-MARL), a framework that efficiently combines LLMs and MARL. This framework decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination. Specifically, LGC-MARL consists of two main components: an LLM planner and a graph-based collaboration meta policy. The LLM planner transforms complex task instructions into a series of executable subtasks, evaluates the rationality of these subtasks using a critic model, and generates an action dependency graph. The graph-based collaboration meta policy facilitates communication and collaboration among agents based on the action dependency graph, and adapts to new task environments through meta-learning. Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL in completing various complex tasks.
中文: 提出的LGC-MARL框架将大语言模型与多智能体强化学习相结合,通过图协调机制将复杂任务分解为可执行子任务,在动态环境中展现出卓越的性能和扩展性。
English: The proposed LGC-MARL framework integrates Large Language Models with Multi-Agent Reinforcement Learning to decompose complex tasks into executable subtasks and enable efficient agent collaboration through graph-based coordination, demonstrating superior performance in dynamic environments.

Authors:Fan Yin, Philippe Laban, Xiangyu Peng, Yilun Zhou, Yixin Mao, Vaibhav Vats, Linnea Ross, Divyansh Agarwal, Caiming Xiong, Chien-Sheng Wu
Title: BingoGuard: LLM Content Moderation Tools with Risk Levels
Abstract:
Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.
中文摘要:本文介绍了BingoGuard,一种基于大语言模型的审核系统,通过针对11个有害主题的严重程度标准和创新的训练框架,能够预测安全标签和风险等级,并在多个基准测试中实现了最优性能。
English Summary: This paper introduces BingoGuard, an LLM-based moderation system that uses per-topic severity rubrics to predict both safety labels and severity levels, achieving state-of-the-art performance through a novel training framework and dataset creation method.

Authors:Adarsh Salagame, Eric Sihite, Milad Ramezani, Alireza Ramezani
Title: Reduced-Order Model-Based Gait Generation for Snake Robot Locomotion using NMPC
Abstract:
This paper presents an optimization-based motion planning methodology for snake robots operating in constrained environments. By using a reduced-order model, the proposed approach simplifies the planning process, enabling the optimizer to autonomously generate gaits while constraining the robot's footprint within tight spaces. The method is validated through high-fidelity simulations that accurately model contact dynamics and the robot's motion. Key locomotion strategies are identified and further demonstrated through hardware experiments, including successful navigation through narrow corridors.
中文摘要:本文提出了一种基于优化的蛇形机器人运动规划方法,通过简化模型在受限空间中自主生成步态,并经过高精度仿真和包括穿越狭窄走廊在内的硬件实验验证。
English Summary: This paper introduces an optimization-based motion planning method for snake robots that uses a reduced-order model to simplify gait generation in confined spaces, validated through simulations and hardware experiments including corridor navigation.

Authors:Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Title: Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
Abstract:
In this paper, we introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task. Existing document reranking methods based on large language models (LLMs) typically rely on prompting or fine-tuning LLMs to order or label candidate documents according to their relevance to a query. For Rank-R1, we use a reinforcement learning algorithm along with only a small set of relevance labels (without any reasoning supervision) to enhance the reasoning ability of LLM-based rerankers. Our hypothesis is that adding reasoning capabilities to the rerankers can improve their relevance assessement and ranking capabilities. Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries. In particular, we find that Rank-R1 achieves effectiveness on in-domain datasets at par with that of supervised fine-tuning methods, but utilizing only 18\% of the training data used by the fine-tuning methods. We also find that the model largely outperforms zero-shot and supervised fine-tuning when applied to out-of-domain datasets featuring complex queries, especially when a 14B-size model is used. Finally, we qualitatively observe that Rank-R1's reasoning process improves the explainability of the ranking results, opening new opportunities for search engine results presentation and fruition.
中文: 本文提出Rank-R1,一种基于大语言模型的排序器,通过强化学习增强推理能力,在少量训练数据下显著提升复杂查询的排序效果和结果可解释性。
English: This paper presents Rank-R1, an LLM-based reranker enhanced with reinforcement learning to perform reasoning on queries and documents, which significantly improves ranking effectiveness and explainability, especially for complex queries with minimal training data.

Authors:Mohammadreza Malekabbasi, Tobias Pfandzelter, David Bermbach
Title: Umbilical Choir: Automated Live Testing for Edge-To-Cloud FaaS Applications
Abstract:
Application users react negatively to performance regressions or availability issues across software releases. To address this, modern cloud-based applications with their multiple daily releases rely on live testing techniques such as A/B testing or canary releases. In edge-to-cloud applications, however, which have similar problems, developers currently still have to hard-code custom live testing tooling as there is no general framework for edge-to-cloud live testing. With Umbilical Choir, we partially close this gap for serverless edge-to-cloud applications. Umbilical Choir is compatible with all Function-as-a-Service platforms and (extensively) supports various live testing techniques, including canary releases with various geo-aware strategies, A/B testing, and gradual roll-outs. We evaluate Umbilical Choir through a complex release scenario showcasing various live testing techniques in a mixed edge-cloud deployments and discuss different geo-aware strategies.
中文: 应用用户对软件更新中的性能问题反应消极,云应用采用A/B测试等实时测试方法,而边云应用缺乏通用框架;Umbilical Choir通过支持多种实时测试技术填补了这一空白,适用于无服务器边云应用。
English: Application users dislike performance issues in software updates, and while cloud apps use live testing methods like A/B testing, edge-to-cloud apps lack a general framework, which Umbilical Choir addresses by supporting various live testing techniques for serverless applications.

Authors:Yu Feng, Zheng Liu, Weikai Lin, Zihan Liu, Jingwen Leng, Minyi Guo, Zhezhi He, Jieru Zhao, Yuhao Zhu
Title: StreamGrid: Streaming Point Cloud Analytics via Compulsory Splitting and Deterministic Termination
Abstract:
Point clouds are increasingly important in intelligent applications, but frequent off-chip memory traffic in accelerators causes pipeline stalls and leads to high energy consumption. While conventional line buffer techniques can eliminate off-chip traffic, they cannot be directly applied to point clouds due to their inherent computation patterns. To address this, we introduce two techniques: compulsory splitting and deterministic termination, enabling fully-streaming processing. We further propose StreamGrid, a framework that integrates these techniques and automatically optimizes on-chip buffer sizes. Our evaluation shows StreamGrid reduces on-chip memory by 61.3\% and energy consumption by 40.5\% with marginal accuracy loss compared to the baselines without our techniques. Additionally, we achieve 10.0$\times$ speedup and 3.9$\times$ energy efficiency over state-of-the-art accelerators.
中文: StreamGrid框架通过强制分割和确定性终止技术实现点云全流处理,在几乎不影响精度的前提下将片上内存减少61.3%、能耗降低40.5%,并较先进加速器实现10倍速度提升。
English: StreamGrid introduces compulsory splitting and deterministic termination to enable fully-streaming point cloud processing, significantly reducing on-chip memory by 61.3% and energy consumption by 40.5% while achieving 10x speedup over existing accelerators.

Authors:Xiaotong Huang, He Zhu, Zihan Liu, Weikai Lin, Xiaohong Liu, Zhezhi He, Jingwen Leng, Minyi Guo, Yu Feng
Title: SeeLe: A Unified Acceleration Framework for Real-Time Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has become a crucial rendering technique for many real-time applications. However, the limited hardware resources on today's mobile platforms hinder these applications, as they struggle to achieve real-time performance. In this paper, we propose SeeLe, a general framework designed to accelerate the 3DGS pipeline for resource-constrained mobile devices. Specifically, we propose two GPU-oriented techniques: hybrid preprocessing and contribution-aware rasterization. Hybrid preprocessing alleviates the GPU compute and memory pressure by reducing the number of irrelevant Gaussians during rendering. The key is to combine our view-dependent scene representation with online filtering. Meanwhile, contribution-aware rasterization improves the GPU utilization at the rasterization stage by prioritizing Gaussians with high contributions while reducing computations for those with low contributions. Both techniques can be seamlessly integrated into existing 3DGS pipelines with minimal fine-tuning. Collectively, our framework achieves 2.6$\times$ speedup and 32.3\% model reduction while achieving superior rendering quality compared to existing methods.
Chinese: 提出的SeeLe框架通过混合预处理和贡献感知光栅化技术,在移动设备上加速3D高斯泼溅渲染,实现了2.6倍加速和32.3%模型精简,同时保持卓越的渲染质量。
English: The proposed SeeLe framework accelerates 3D Gaussian Splatting on mobile devices through hybrid preprocessing and contribution-aware rasterization, achieving 2.6× speedup and 32.3% model reduction while maintaining high rendering quality.

Authors:Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu
Title: START: Self-taught Reasoner with Tools
Abstract:
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
Chinese: 本文提出START模型,通过提示推理和拒绝采样微调技术实现外部工具集成,有效解决大型推理模型的幻觉与低效问题,在多项基准测试中达到领先性能。
English: This paper introduces START, a self-learning reasoning model that integrates external tools through hint-infer and Hint-RFT techniques to overcome hallucinations and inefficiencies in large reasoning models, achieving state-of-the-art performance across multiple benchmarks.

Authors:Zhenghua Wang, Yiran Ding, Changze Lv, Zhibo Xu, Tianlong Li, Tianyuan Shi, Xiaoqing Zheng, Xuanjing Huang
Title: Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling
Abstract:
Although large language models (LLMs) have achieved significant progress in handling long-context inputs, they still suffer from the ``lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost. Our extensive experiments reveal that this issue may arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To address this problem, we propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer, slowing down the decay rate caused by RoPE to make the model pay more attention to the middle context. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bezier curves to reduce the search space. Through comprehensive experimentation, we demonstrate that our method significantly alleviates the ``lost-in-the-middle'' problem. Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific interpolation, as opposed to uniform interpolation across all layers, enhances the model's extrapolation capabilities when combined with PI and Dynamic-NTK positional encoding schemes.
中文: 本研究通过提出分层位置编码缩放方法,利用遗传算法优化各层缩放因子,有效缓解了大语言模型的“中间信息丢失”问题,显著增强了对中间上下文的关注,并将关键值检索准确率最高提升20%。
English: This study addresses the "lost-in-the-middle" problem in large language models by introducing a layer-specific positional encoding scaling method that uses a genetic algorithm to optimize scaling factors, significantly improving mid-context attention and boosting key-value retrieval accuracy by up to 20%.

Authors:Sicong Liu, Bin Guo, Shiyan Luo, Yuzhan Wang, Hao Luo, Cheng Fang, Yuan Xu, Ke Ma, Yao Li, Zhiwen Yu
Title: CrowdHMTware: A Cross-level Co-adaptation Middleware for Context-aware Mobile DL Deployment
Abstract:
There are many deep learning (DL) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives.To enable robust and private mobile sensing, DL models are often deployed locally on resource-constrained mobile devices using techniques such as model compression or offloading.However, existing methods, either front-end algorithm level (i.e. DL model compression/partitioning) or back-end scheduling level (i.e. operator/resource scheduling), cannot be locally online because they require offline retraining to ensure accuracy or rely on manually pre-defined strategies, struggle with dynamic adaptability.The primary challenge lies in feeding back runtime performance from the back-end level to the front-end level optimization decision. Moreover, the adaptive mobile DL model porting middleware with cross-level co-adaptation is less explored, particularly in mobile environments with diversity and dynamics. In response, we introduce CrowdHMTware, a dynamic context-adaptive DL model deployment middleware for heterogeneous mobile devices. It establishes an automated adaptation loop between cross-level functional components, i.e. elastic inference, scalable offloading, and model-adaptive engine, enhancing scalability and adaptability. Experiments with four typical tasks across 15 platforms and a real-world case study demonstrate that CrowdHMTware can effectively scale DL model, offloading, and engine actions across diverse platforms and tasks. It hides run-time system issues from developers, reducing the required developer expertise.
中文: CrowdHMTware提出了一种动态情境自适应的中间件,通过在弹性推理、可扩展卸载和模型自适应引擎之间建立跨层级自动适配机制,实现在多样化移动平台上有效扩展深度学习模型部署,同时降低对开发人员专业知识的要求。
English: CrowdHMTware introduces a dynamic context-adaptive middleware that establishes automated cross-level adaptation between elastic inference, scalable offloading, and model-adaptive engine components, effectively scaling DL model deployment across diverse mobile platforms while reducing developer expertise requirements.

Authors:Ziqiang Cui, Yunpeng Weng, Xing Tang, Xiaokun Zhang, Dugang Liu, Shiwei Li, Peiyang Liu, Bowei He, Weihong Luo, Xiuqiang He, Chen Ma
Title: Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation
Abstract:
Sequential recommendation aims to model user preferences based on historical behavior sequences, which is crucial for various online platforms. Data sparsity remains a significant challenge in this area as most users have limited interactions and many items receive little attention. To mitigate this issue, contrastive learning has been widely adopted. By constructing positive sample pairs from the data itself and maximizing their agreement in the embedding space,it can leverage available data more effectively. Constructing reasonable positive sample pairs is crucial for the success of contrastive learning. However, current approaches struggle to generate reliable positive pairs as they either rely on representations learned from inherently sparse collaborative signals or use random perturbations which introduce significant uncertainty. To address these limitations, we propose a novel approach named Semantic Retrieval Augmented Contrastive Learning (SRA-CL), which leverages semantic information to improve the reliability of contrastive samples. SRA-CL comprises two main components: (1) Cross-Sequence Contrastive Learning via User Semantic Retrieval, which utilizes large language models (LLMs) to understand diverse user preferences and retrieve semantically similar users to form reliable positive samples through a learnable sample synthesis method; and (2) Intra-Sequence Contrastive Learning via Item Semantic Retrieval, which employs LLMs to comprehend items and retrieve similar items to perform semantic-based item substitution, thereby creating semantically consistent augmented views for contrastive learning. SRA-CL is plug-and-play and can be integrated into standard sequential recommendation models. Extensive experiments on four public datasets demonstrate the effectiveness and generalizability of the proposed approach.
中文: 序列推荐面临数据稀疏性挑战,对比学习通过构建正样本对来缓解此问题,但现有方法难以保证样本可靠性,因此提出SRA-CL方法,利用大语言模型的语义信息,通过跨序列和序列内对比学习提升样本质量。
English: Sequential recommendation faces data sparsity challenges, which contrastive learning addresses by creating positive sample pairs, but current methods struggle with reliability, leading to the proposed SRA-CL approach that uses semantic information from LLMs to enhance sample quality through cross-sequence and intra-sequence contrastive learning.

Authors:Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng
Title: Ticktack : Long Span Temporal Alignment of Large Language Models Leveraging Sexagenary Cycle Time Expression
Abstract:
Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named "Ticktack" for addressing the LLM's long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.
Chinese: 本文提出“Ticktack”方法,通过采用干支纪年表达和极坐标建模,解决大语言模型在长时间跨度上的时序错位问题,并利用时序表示对齐技术提升其在时间相关任务中的表现。
English: This paper introduces "Ticktack," a method that uses sexagenary year expressions and polar coordinates to address temporal misalignment in large language models over long periods, improving their performance on time-related tasks through post-training alignment.

Authors:Yunxiao Shi, Hong Cai, Amin Ansari, Fatih Porikli
Title: H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision
Abstract:
3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention-based 2D-3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods. In addition, to compensate for the ambiguity in ground-truth 3D occupancy labels, we advocate leveraging auxiliary tasks to complement the direct 3D supervision. In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels that introduces rich and heterogeneous supervision signals. We conduct extensive experiments on the Occ3D-nuScenes and SemanticKITTI benchmarks that demonstrate the superiority of our proposed H3O.
中文: H3O方法提出了一种高效的3D占据预测架构,通过集成深度估计和语义分割等辅助任务,在显著降低计算成本的同时提升了自动驾驶场景理解的准确性。
English: The H3O approach introduces an efficient architecture for 3D occupancy prediction in autonomous driving, significantly reducing computational costs while enhancing accuracy through auxiliary tasks like depth estimation and semantic segmentation.

Authors:Xiaomeng Zhu, Yuyang Li, Leiyao Cui, Pengfei Li, Huan-ang Gao, Yixin Zhu, Hao Zhao
Title: Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation
Abstract:
Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X's effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.
中文: 本研究提出了大规模数据集LVIS-Aff和高效模型Afford-X,显著提升了任务导向应用中的功能推理能力,在保持紧凑结构和快速推理的同时,性能优于现有方法,适用于本地部署。
English: The study introduces LVIS-Aff, a large-scale dataset, and Afford-X, an efficient model that significantly improves affordance reasoning for task-oriented applications, outperforming existing methods while being compact and fast for local deployment.

Authors:Xing Tang, Yunpeng Weng, Fuyuan Lyu, Dugang Liu, Xiuqiang He
Title: A Predict-Then-Optimize Customer Allocation Framework for Online Fund Recommendation
Abstract:
With the rapid growth of online investment platforms, funds can be distributed to individual customers online. The central issue is to match funds with potential customers under constraints. Most mainstream platforms adopt the recommendation formulation to tackle the problem. However, the traditional recommendation regime has its inherent drawbacks when applying the fund-matching problem with multiple constraints. In this paper, we model the fund matching under the allocation formulation. We design PTOFA, a Predict-Then-Optimize Fund Allocation framework. This data-driven framework consists of two stages, i.e., prediction and optimization, which aim to predict expected revenue based on customer behavior and optimize the impression allocation to achieve the maximum revenue under the necessary constraints, respectively. Extensive experiments on real-world datasets from an industrial online investment platform validate the effectiveness and efficiency of our solution. Additionally, the online A/B tests demonstrate PTOFA's effectiveness in the real-world fund recommendation scenario.
Chinese: 本文提出PTOFA预测优化双阶段框架,通过预测客户收益和优化资金分配,在约束条件下实现收益最大化,真实场景实验验证了其有效性。
English: This paper introduces PTOFA, a two-stage Predict-Then-Optimize framework that predicts customer revenue and optimizes fund allocation to maximize returns under constraints, with real-world experiments confirming its effectiveness.

Authors:Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo
Title: Position: Model Collapse Does Not Mean What You Think
Abstract:
The proliferation of AI-generated content online has fueled concerns over \emph{model collapse}, a degradation in future generative models' performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society's current trajectory have received disproportionately less attention.
中文: 本文质疑模型崩溃的灾难性叙事,指出定义混乱和研究条件脱离实际导致威胁被夸大,同时使社会更可能面临的其他AI风险未获足够重视。
English: This position paper challenges the oversimplified narrative of catastrophic model collapse, arguing that inconsistent definitions and unrealistic research conditions have exaggerated the threat while diverting attention from more plausible AI harms.

Authors:Zhenpeng Chen, Chong Wang, Weisong Sun, Guang Yang, Xuanzhe Liu, Jie M. Zhang, Yang Liu
Title: Promptware Engineering: Software Engineering for LLM Prompt Development
Abstract:
Large Language Models (LLMs) are increasingly integrated into software applications, with prompts serving as the primary 'programming' interface to guide their behavior. As a result, a new software paradigm, promptware, has emerged, using natural language prompts to interact with LLMs and enabling complex tasks without traditional coding. Unlike traditional software, which relies on formal programming languages and deterministic runtime environments, promptware is based on ambiguous, unstructured, and context-dependent natural language and operates on LLMs as runtime environments, which are probabilistic and non-deterministic. These fundamental differences introduce unique challenges in prompt development. In practice, prompt development is largely ad hoc and experimental, relying on a time-consuming trial-and-error process - a challenge we term the 'promptware crisis.' To address this, we propose promptware engineering, a new methodology that adapts established software engineering principles to the process of prompt development. Building on decades of success in traditional software engineering, we envision a systematic framework that includes prompt requirements engineering, design, implementation, testing, debugging, and evolution. Unlike traditional software engineering, our framework is specifically tailored to the unique characteristics of prompt development. This paper outlines a comprehensive roadmap for promptware engineering, identifying key research directions and offering actionable insights to advance LLM-based software development.
Chinese Summary: 提示软件工程是一种新方法,它将软件工程原则应用于大型语言模型的提示开发,以应对其独特挑战,并提供了一个系统化的框架来指导提示的创建和演进。
English Summary: Promptware engineering is a new methodology that adapts software engineering principles to address the challenges of developing prompts for Large Language Models, offering a systematic framework for their creation and evolution.

Authors:Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin
Title: VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
Abstract:
In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the integration of VQ with various computations. The codebook cache adaptively stores different entries across the GPU's memory hierarchy, including off-chip global memory, on-chip shared memory, and registers. Centered around the codebook cache, we design an efficient computation engine that optimizes memory traffic during computations involving codebooks. This compute engine adopts the codebook-centric dataflow and fusion optimizations. Additionally, we provide adaptive heuristics to tailor parameter selection in our optimizations to diverse VQ configurations. Our optimizations achieve an average latency reduction of 46.13% compared to unoptimized versions. Compared to existing open-source implementations, our methods decrease latency by 64.36% to 99.1%. A final comparison with state-of-the-art element-wise quantization methods like AWQ and KVQuant shows that our VQ-LLM is practically viable, achieving latencies close or even better latencies to those at equivalent bit-widths, potentially offering greater accuracy.
中文: 本文提出VQ-LLM框架,通过代码本缓存优化和内存层次自适应管理,结合计算融合技术大幅降低处理延迟,相比现有方法最高可减少99.1%的延迟,在保持精度的同时实现了与先进量化方法相当甚至更优的性能。
English: This paper introduces VQ-LLM, an efficient vector quantization framework that optimizes codebook access and computation through adaptive memory hierarchy utilization and fusion techniques, achieving significant latency reductions compared to unoptimized and existing methods while maintaining competitive performance with state-of-the-art quantization approaches.

Authors:Yumeng Song, Yu Gu, Tianyi Li, Yushuai Li, Christian S. Jensen, Ge Yu
Title: Quantifying Point Contributions: A Lightweight Framework for Efficient and Effective Query-Driven Trajectory Simplification
Abstract:
As large volumes of trajectory data accumulate, simplifying trajectories to reduce storage and querying costs is increasingly studied. Existing proposals face three main problems. First, they require numerous iterations to decide which GPS points to delete. Second, they focus only on the relationships between neighboring points (local information) while neglecting the overall structure (global information), reducing the global similarity between the simplified and original trajectories and making it difficult to maintain consistency in query results, especially for similarity-based queries. Finally, they fail to differentiate the importance of points with similar features, leading to suboptimal selection of points to retain the original trajectory information. We propose MLSimp, a novel Mutual Learning query-driven trajectory simplification framework that integrates two distinct models: GNN-TS, based on graph neural networks, and Diff-TS, based on diffusion models. GNN-TS evaluates the importance of a point according to its globality, capturing its correlation with the entire trajectory, and its uniqueness, capturing its differences from neighboring points. It also incorporates attention mechanisms in the GNN layers, enabling simultaneous data integration from all points within the same trajectory and refining representations, thus avoiding iterative processes. Diff-TS generates amplified signals to enable the retention of the most important points at low compression rates. Experiments involving eight baselines on three databases show that MLSimp reduces the simplification time by 42%--70% and improves query accuracy over simplified trajectories by up to 34.6%.
中文: MLSimp是一种互学习框架,融合图神经网络和扩散模型,通过全局和独特性评估轨迹点重要性,大幅缩短简化时间并提升查询精度。
English: MLSimp is a mutual learning framework that combines graph neural networks and diffusion models to efficiently simplify trajectories by evaluating point importance globally and uniquely, significantly reducing processing time and improving query accuracy.

Authors:Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li
Title: from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Abstract:
Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
Chinese: 当前研究揭示,尽管越狱攻击可使大语言模型生成有害内容,但更有效的方法是诱导其将良性内容转化为有害形式,如新型AVATAR框架通过利用对抗性隐喻实现高成功率所展示。
English: Current research reveals that while jailbreak attacks can make Large Language Models produce harmful content, a more effective method involves manipulating them to transform benign content into harmful forms, as demonstrated by the novel AVATAR framework which achieves high success rates by leveraging adversarial metaphors.

Authors:Jianfang Chen, Kai Zhang, Aoran Gan, Shiwei Tong, Shuanghong Shen, Qi Liu
Title: Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context
Abstract:
Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using language models like T5 and BERT have mitigated these issues by converting KG triples into text for reasoning. However, they often fail to fully utilize contextual information, focusing mainly on the neighborhood of the entity and neglecting the context of the relation. To address this issue, we propose KGC-ERC, a framework that integrates both types of context to enrich the input of generative language models and enhance their reasoning capabilities. Additionally, we introduce a sampling strategy to effectively select relevant context within input token constraints, which optimizes the utilization of contextual information and potentially improves model performance. Experiments on the Wikidata5M, Wiki27K, and FB15K-237-N datasets show that KGC-ERC outperforms or matches state-of-the-art baselines in predictive performance and scalability.
Chinese: KGC-ERC是一种新颖的知识图谱补全框架,通过将实体和关系上下文整合到生成式语言模型中,并采用高效采样策略优化上下文利用,在多个基准数据集上实现了卓越的预测性能和可扩展性。
English: KGC-ERC is a novel framework that enhances knowledge graph completion by integrating entity and relation contexts into generative language models, coupled with an efficient sampling strategy to optimize context usage, achieving superior performance and scalability on benchmark datasets.

Authors:Kexian Tang, Junyao Gao, Yanhong Zeng, Haodong Duan, Yanan Sun, Zhening Xing, Wenran Liu, Kaifeng Lyu, Kai Chen
Title: LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Abstract:
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of 20 state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. Furthermore, based on LEGO-Puzzles, we design generation tasks to investigate whether MLLMs can transfer their spatial understanding and reasoning abilities to image generation. Our experiments show that only GPT-4o and Gemini-2.0-Flash exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.
中文: LEGO-Puzzles基准通过1,100个视觉任务评估多模态大模型的空间与序列推理能力,发现即使最优模型的表现也远逊于人类,且难以将此类能力迁移至图像生成任务。
English: The LEGO-Puzzles benchmark evaluates multimodal large language models' spatial and sequential reasoning through 1,100 visual tasks, revealing that even top models perform significantly worse than humans and struggle with transferring these skills to image generation.

Authors:Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
Title: AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
Abstract:
Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.
中文: 本文提出AccVideo方法,通过利用去噪轨迹构建合成数据集并结合对抗训练,在保持高质量视频生成的同时,将视频扩散模型的推理速度提升8.5倍,能生成5秒时长、720x1280分辨率的高清视频。
English: This paper introduces AccVideo, an efficient distillation method that accelerates video diffusion models by using synthetic datasets from denoising trajectories and adversarial training, achieving 8.5x faster generation while maintaining high-quality, high-resolution video output.

Authors:Xunguang Wang, Wenxuan Wang, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang
Title: STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
Abstract:
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.
中文摘要:STShield是一种轻量级框架,通过引入单令牌哨兵机制并结合对抗训练,能在保持模型性能的同时实时有效防御越狱攻击,且计算开销极小。
English Summary: STShield is a lightweight framework that uses a single-token sentinel mechanism and adversarial training to effectively defend against jailbreak attacks in real-time while maintaining model performance with minimal computational overhead.

Authors:Jichen Hu, Chen Yang, Zanwei Zhou, Jiemin Fang, Xiaokang Yang, Qi Tian, Wei Shen
Title: Dereflection Any Image with Diffusion Priors and Diversified Data
Abstract:
Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.
中文摘要:本文提出了一种全面的反光消除解决方案,通过创建多样化数据集并采用扩散模型框架,结合渐进式训练策略,在各类真实场景中实现了最先进的泛化性能。
English Summary: This paper introduces a comprehensive reflection removal solution featuring a diverse dataset and a diffusion-based model that achieves state-of-the-art performance across various real-world scenarios through progressive training.

Authors:Mishal Fatima Minhas, Rachmad Vidya Wicaksana Putra, Falah Awwad, Osman Hasan, Muhammad Shafique
Title: Replay4NCL: An Efficient Memory Replay-based Methodology for Neuromorphic Continual Learning in Embedded AI Systems
Abstract:
Neuromorphic Continual Learning (NCL) paradigm leverages Spiking Neural Networks (SNNs) to enable continual learning (CL) capabilities for AI systems to adapt to dynamically changing environments. Currently, the state-of-the-art employ a memory replay-based method to maintain the old knowledge. However, this technique relies on long timesteps and compression-decompression steps, thereby incurring significant latency and energy overheads, which are not suitable for tightly-constrained embedded AI systems (e.g., mobile agents/robotics). To address this, we propose Replay4NCL, a novel efficient memory replay-based methodology for enabling NCL in embedded AI systems. Specifically, Replay4NCL compresses the latent data (old knowledge), then replays them during the NCL training phase with small timesteps, to minimize the processing latency and energy consumption. To compensate the information loss from reduced spikes, we adjust the neuron threshold potential and learning rate settings. Experimental results on the class-incremental scenario with the Spiking Heidelberg Digits (SHD) dataset show that Replay4NCL can preserve old knowledge with Top-1 accuracy of 90.43% compared to 86.22% from the state-of-the-art, while effectively learning new tasks, achieving 4.88x latency speed-up, 20% latent memory saving, and 36.43% energy saving. These results highlight the potential of our Replay4NCL methodology to further advances NCL capabilities for embedded AI systems.
中文: 提出的Replay4NCL方法通过压缩潜在数据和优化神经元参数,为嵌入式AI系统实现高效的神经形态持续学习,相比现有方法获得了更高准确率、更快处理速度和显著节能效果。
English: The proposed Replay4NCL method enables efficient neuromorphic continual learning for embedded AI systems by compressing latent data and optimizing neuron parameters, achieving higher accuracy, faster processing, and significant energy savings compared to existing approaches.

Authors:Yue Xing, Wensheng Gan, Qidi Chen, Philip S. Yu
Title: AI-Generated Content in Landscape Architecture: A Survey
Abstract:
Landscape design is a complex process that requires designers to engage in intricate planning, analysis, and decision-making. This process involves the integration and reconstruction of science, art, and technology. Traditional landscape design methods often rely on the designer's personal experience and subjective aesthetics, with design standards rooted in subjective perception. As a result, they lack scientific and objective evaluation criteria and systematic design processes. Data-driven artificial intelligence (AI) technology provides an objective and rational design process. With the rapid development of different AI technologies, AI-generated content (AIGC) has permeated various aspects of landscape design at an unprecedented speed, serving as an innovative design tool. This article aims to explore the applications and opportunities of AIGC in landscape design. AIGC can support landscape design in areas such as site research and analysis, design concepts and scheme generation, parametric design optimization, plant selection and visual simulation, construction management, and process optimization. However, AIGC also faces challenges in landscape design, including data quality and reliability, design expertise and judgment, technical challenges and limitations, site characteristics and sustainability, user needs and participation, the balance between technology and creativity, ethics, and social impact. Finally, this article provides a detailed outlook on the future development trends and prospects of AIGC in landscape design. Through in-depth research and exploration in this review, readers can gain a better understanding of the relevant applications, potential opportunities, and key challenges of AIGC in landscape design.
Chinese: 数据驱动的人工智能技术,特别是AIGC,为景观设计提供了客观系统的创新工具,在场地分析与方案生成等方面带来机遇,同时也面临数据可靠性及伦理考量等挑战。
English: Data-driven AI technology, particularly AIGC, offers innovative tools for objective and systematic landscape design processes, presenting opportunities in site analysis and design generation while facing challenges in data reliability and ethical considerations.

Authors:Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han
Title: SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Abstract:
Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
中文: SpeCache通过将完整键值缓存卸载到CPU内存,并根据重要性动态预取关键对,有效解决了GPU内存限制问题,实现了无需重新训练的长序列高效处理且无信息丢失。
English: SpeCache addresses the GPU memory bottleneck in large language models by offloading the full key-value cache to CPU memory and dynamically prefetching essential pairs based on importance, enabling efficient long-sequence processing without information loss.

Authors:Zenghui Yuan, Jiawen Shi, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
Title: BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Abstract:
Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: Token-substitution and Token-addition, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.
中文摘要:BadToken是针对多模态大语言模型的首个令牌级后门攻击方法,通过令牌替换和添加实现灵活隐蔽的攻击,在保持模型效用的同时达到高攻击成功率,对自动驾驶和医疗诊断等实际应用构成严重威胁。
English Summary: BadToken is a token-level backdoor attack method for multi-modal large language models that achieves high attack success and stealthiness through token manipulation while preserving model utility, posing significant threats to real-world applications like autonomous driving and medical diagnosis.

Authors:Jinyi Liu, Yan Zheng, Rong Cheng, Qiyu Wu, Wei Guo, Fei Ni, Hebin Liang, Yifu Yuan, Hangyu Mao, Fuzheng Zhang, Jianye Hao
Title: From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models
Abstract:
Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking'' reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbf{Atomic Reasoner} (\textbf{AR}), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR's superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR's effectiveness in enhancing LLMs' capacity for robust, long-sequence logical reasoning and deliberation.
中文摘要:原子推理器(AR)策略通过将推理过程分解为原子认知单元,动态构建推理路径,从而提升大语言模型的逻辑连贯性并降低计算负担,在语言逻辑谜题中验证了其卓越性能。
English Summary: The Atomic Reasoner (AR) strategy enhances large language models' logical reasoning by decomposing processes into atomic cognitive units, improving coherence while reducing computational load, as validated in linguistic logic puzzles.

Authors:Chejian Xu, Jiawei Zhang, Zhaorun Chen, Chulin Xie, Mintong Kang, Yujin Potter, Zhun Wang, Zhuowen Yuan, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song
Title: MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
Abstract:
Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
中文: 本文提出了首个统一平台MMDT,旨在从安全性、幻觉、公平性等多维度全面评估多模态基础模型的安全可信度,揭示了模型漏洞并为开发更可靠的系统铺平了道路。
English: This paper introduces MMDT, the first unified platform for comprehensively evaluating the safety and trustworthiness of multimodal foundation models across multiple dimensions such as safety, hallucination, and fairness, revealing vulnerabilities and paving the way for more reliable systems.

Authors:Yizhou Xu, Antoine Maillard, Lenka Zdeborová, Florent Krzakala
Title: Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications
Abstract:
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ''Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in [ETB+24] regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in [MTM+24] on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.
中文摘要:本研究将矩阵感知从低秩恢复扩展到高秩结构化矩阵,通过样本量与矩阵元素数量成比例的情况,为贝叶斯最优学习性能提供了严格的渐近方程描述。
English Summary: This study extends matrix sensing beyond low-rank recovery to include high-rank structured matrices, providing rigorous asymptotic equations for Bayes-optimal performance with sample sizes proportional to matrix entries.

Authors:Hang Zhao, Hongru Li, Dongfang Xu, Shenghui Song, Khaled B. Letaief
Title: Multi-Modal Self-Supervised Semantic Communication
Abstract:
Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.
中文: 提出的多模态语义通信系统采用自监督学习提取任务无关特征,在NYU Depth V2数据集上显著降低了训练开销,同时保持或超越了监督学习方法的性能。
English: The proposed multi-modal semantic communication system utilizes self-supervised learning to extract task-agnostic features, significantly reducing training overhead while maintaining or surpassing the performance of supervised methods on the NYU Depth V2 dataset.

Authors:Cheng Yuan, Zhening Liu, Jiashu Lv, Jiawei Shao, Yufei Jiang, Jun Zhang, Xuelong Li
Title: Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference
Abstract:
With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency while maintaining identical task performance, compared with neural compression ELIC.
中文: 提出的面向任务特征压缩方法通过聚类和编码视觉特征,有效降低了多模态理解中的传输与计算延迟,在保持任务性能的同时显著提升了系统效率。
English: The proposed task-oriented feature compression method reduces transmission and computation latency in multimodal understanding by clustering and encoding visual features, achieving significant efficiency gains without compromising task performance.

Authors:Jiaqing Zhang, Miguel Contreras, Jessica Sena, Andrea Davidson, Yuanfang Ren, Ziyuan Guan, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Subhash Nerella, Azra Bihorac, Parisa Rashidi
Title: MELON: Multimodal Mixture-of-Experts with Spectral-Temporal Fusion for Long-Term Mobility Estimation in Critical Care
Abstract:
Patient mobility monitoring in intensive care is critical for ensuring timely interventions and improving clinical outcomes. While accelerometry-based sensor data are widely adopted in training artificial intelligence models to estimate patient mobility, existing approaches face two key limitations highlighted in clinical practice: (1) modeling the long-term accelerometer data is challenging due to the high dimensionality, variability, and noise, and (2) the absence of efficient and robust methods for long-term mobility assessment. To overcome these challenges, we introduce MELON, a novel multimodal framework designed to predict 12-hour mobility status in the critical care setting. MELON leverages the power of a dual-branch network architecture, combining the strengths of spectrogram-based visual representations and sequential accelerometer statistical features. MELON effectively captures global and fine-grained mobility patterns by integrating a pre-trained image encoder for rich frequency-domain feature extraction and a Mixture-of-Experts encoder for sequence modeling. We trained and evaluated the MELON model on the multimodal dataset of 126 patients recruited from nine Intensive Care Units at the University of Florida Health Shands Hospital main campus in Gainesville, Florida. Experiments showed that MELON outperforms conventional approaches for 12-hour mobility status estimation with an overall area under the receiver operating characteristic curve (AUROC) of 0.82 (95\%, confidence interval 0.78-0.86). Notably, our experiments also revealed that accelerometer data collected from the wrist provides robust predictive performance compared with data from the ankle, suggesting a single-sensor solution that can reduce patient burden and lower deployment costs...
中文: MELON是一种新型多模态框架,通过结合基于频谱图的视觉特征和加速度计序列数据,克服了长期患者活动监测中的挑战,以0.82的AUROC实现了优异的12小时活动预测,并验证了腕部传感器的有效性。
English: MELON is a novel multimodal framework that overcomes challenges in long-term patient mobility monitoring by combining spectrogram-based visual features and sequential accelerometer data, achieving superior 12-hour mobility prediction with an AUROC of 0.82 and demonstrating wrist sensor efficacy.

Authors:Tianrui Pan, Lin Liu, Jie Liu, Xiaopeng Zhang, Jie Tang, Gangshan Wu, Qi Tian
Title: RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing
Abstract:
Portrait video editing focuses on modifying specific attributes of portrait videos, guided by audio or video streams. Previous methods typically either concentrate on lip-region reenactment or require training specialized models to extract keypoints for motion transfer to a new identity. In this paper, we introduce a training-free universal portrait video editing framework that provides a versatile and adaptable editing strategy. This framework supports portrait appearance editing conditioned on the changed first reference frame, as well as lip editing conditioned on varied speech, or a combination of both. It is based on a Unified Animation Control (UAC) mechanism with source inversion latents to edit the entire portrait, including visual-driven shape control, audio-driven speaking control, and inter-frame temporal control. Furthermore, our method can be adapted to different scenarios by adjusting the initial reference frame, enabling detailed editing of portrait videos with specific head rotations and facial expressions. This comprehensive approach ensures a holistic and flexible solution for portrait video editing. The experimental results show that our model can achieve more accurate and synchronized lip movements for the lip editing task, as well as more flexible motion transfer for the appearance editing task. Demo is available at https://alice01010101.github.io/RASA/.
中文: 本文提出了一种无需训练的通用人像视频编辑框架,通过统一动画控制机制支持外观与唇形编辑,实现了精确的唇语同步和灵活的运动迁移。
English: This paper presents a training-free universal portrait video editing framework that supports appearance and lip editing through a Unified Animation Control mechanism, achieving accurate lip synchronization and flexible motion transfer.

Authors:Jing Bi, Junjia Guo, Susan Liang, Guangyu Sun, Luchuan Song, Yunlong Tang, Jinxi He, Jiarui Wu, Ali Vosoughi, Chen Chen, Chenliang Xu
Title: VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Abstract:
Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).
中文: VERIFY基准通过隔离视觉线索和提供人工标注的推理路径,严格评估多模态大语言模型的视觉推理能力,揭示了当前模型在识别优势之外的显著局限性。
English: The VERIFY benchmark is introduced to rigorously evaluate visual reasoning in MLLMs by isolating visual cues and providing human-annotated reasoning paths, revealing significant limitations in current models despite their recognition strengths.

Authors:Vida Gholamiyan, Yaning Zhao, Wafa Labidi, Holger Boche, Christian Deppe
Title: Security and Privacy: Key Requirements for Molecular Communication in Medicine and Healthcare
Abstract:
Molecular communication (MC) is an emerging paradigm that enables data transmission through biochemical signals rather than traditional electromagnetic waves. This approach is particularly promising for environments where conventional wireless communication is impractical, such as within the human body. However, security and privacy pose significant challenges that must be addressed to ensure reliable communication. Moreover, MC is often event-triggered, making it logical to adopt goal-oriented communication strategies, similar to those used in message identification. This work explores secure identification strategies for MC, with a focus on the information-theoretic security of message identification over Poisson wiretap channels (DT-PWC).
中文摘要:本研究探索分子通信中的安全识别策略,重点研究泊松窃听信道在事件触发场景下消息识别的信息论安全性。
English Summary: This work investigates secure identification strategies for molecular communication, focusing on information-theoretic security for message transmission over Poisson wiretap channels in event-triggered scenarios.

Authors:Xueyang Zhou, Guiyao Tie, Guowen Zhang, Weidong Wang, Zhigang Zuo, Di Wu, Duanfeng Chu, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun
Title: Exploring the Necessity of Reasoning in LLM-based Agent Scenarios
Abstract:
The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs' balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.
中文摘要:LaRMA框架研究表明,大型推理模型在需要迭代反思的推理任务中优于大型语言模型,而后者在执行任务中表现更佳,混合配置能优化智能体性能,尽管推理模型存在计算成本高和行为偏差等挑战。
English Summary: The LaRMA framework reveals LRMs outperform LLMs in reasoning tasks through iterative reflection while LLMs excel in execution tasks, with hybrid configurations optimizing agent performance despite LRMs' higher computational costs and behavioral challenges.

Authors:Haiqin Cui, Yifu Yuan, Yan Zheng, Jianye Hao
Title: AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI
Abstract:
Navigation and manipulation in open-world environments remain unsolved challenges in the Embodied AI. The high cost of commercial mobile manipulation robots significantly limits research in real-world scenes. To address this issue, we propose AhaRobot, a low-cost and fully open-source dual-arm mobile manipulation robot system with a hardware cost of only $1,000 (excluding optional computational resources), which is less than 1/15 of the cost of popular mobile robots. The AhaRobot system consists of three components: (1) a novel low-cost hardware architecture primarily composed of off-the-shelf components, (2) an optimized control solution to enhance operational precision integrating dual-motor backlash control and static friction compensation, and (3) a simple remote teleoperation method RoboPilot. We use handles to control the dual arms and pedals for whole-body movement. The teleoperation process is low-burden and easy to operate, much like piloting. RoboPilot is designed for remote data collection in embodied scenarios. Experimental results demonstrate that RoboPilot significantly enhances data collection efficiency in complex manipulation tasks, achieving a 30% increase compared to methods using 3D mouse and leader-follower systems. It also excels at completing extremely long-horizon tasks in one go. Furthermore, AhaRobot can be used to learn end-to-end policies and autonomously perform complex manipulation tasks, such as pen insertion and cleaning up the floor. We aim to build an affordable yet powerful platform to promote the development of embodied tasks on real devices, advancing more robust and reliable embodied AI. All hardware and software systems are available at https://aha-robot.github.io.
Chinese: AhaRobot是一款成本仅1000美元的低成本开源双臂移动操作机器人,集成了创新硬件、优化控制和RoboPilot远程操作系统,显著提升了具身AI场景中的数据采集效率和复杂任务执行能力。
English: The AhaRobot is a low-cost, open-source dual-arm mobile manipulation robot priced at $1,000, featuring innovative hardware, optimized control, and RoboPilot teleoperation to enhance data collection and task performance in embodied AI.

Authors:Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pang, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, Qingsong Wen
Title: A Survey on Trustworthy LLM Agents: Threats and Countermeasures
Abstract:
With the rapid evolution of Large Language Models (LLMs), LLM-based agents and Multi-agent Systems (MAS) have significantly expanded the capabilities of LLM ecosystems. This evolution stems from empowering LLMs with additional modules such as memory, tools, environment, and even other agents. However, this advancement has also introduced more complex issues of trustworthiness, which previous research focused solely on LLMs could not cover. In this survey, we propose the TrustAgent framework, a comprehensive study on the trustworthiness of agents, characterized by modular taxonomy, multi-dimensional connotations, and technical implementation. By thoroughly investigating and summarizing newly emerged attacks, defenses, and evaluation methods for agents and MAS, we extend the concept of Trustworthy LLM to the emerging paradigm of Trustworthy Agent. In TrustAgent, we begin by deconstructing and introducing various components of the Agent and MAS. Then, we categorize their trustworthiness into intrinsic (brain, memory, and tool) and extrinsic (user, agent, and environment) aspects. Subsequently, we delineate the multifaceted meanings of trustworthiness and elaborate on the implementation techniques of existing research related to these internal and external modules. Finally, we present our insights and outlook on this domain, aiming to provide guidance for future endeavors.
中文:TrustAgent框架通过将可信度分为内在和外在方面,全面研究基于大语言模型的智能体和多智能体系统的可信度,详细阐述攻击、防御和评估方法,从而将可信赖原则从大语言模型扩展到智能体领域。
English: The TrustAgent framework comprehensively addresses the trustworthiness of LLM-based agents and multi-agent systems by categorizing it into intrinsic and extrinsic aspects, detailing attacks, defenses, and evaluations to extend trustworthy principles from LLMs to agents.

Authors:Andrea E. Davidson, Jessica M. Ray, Ayush K. Patel, Yulia Strekalova Levites, Parisa Rashidi, Azra Bihorac
Title: An Iterative, User-Centered Design of a Clinical Decision Support System for Critical Care Assessments: Co-Design Sessions with ICU Clinical Providers
Abstract:
This study reports the findings of qualitative interview sessions conducted with ICU clinicians for the co-design of a system user interface of an artificial intelligence (AI)-driven clinical decision support (CDS) system. This system integrates medical record data with wearable sensor, video, and environmental data into a real-time dynamic model that quantifies patients' risk of clinical decompensation and risk of developing delirium, providing actionable alerts to augment clinical decision-making in the ICU setting. Co-design sessions were conducted as semi-structured focus groups and interviews with ICU clinicians, including physicians, mid-level practitioners, and nurses. Study participants were asked about their perceptions on AI-CDS systems, their system preferences, and were asked to provide feedback on the current user interface prototype. Session transcripts were qualitatively analyzed to identify key themes related to system utility, interface design features, alert preferences, and implementation considerations. Ten clinicians participated in eight sessions. The analysis identified five themes: (1) AI's computational utility, (2) workflow optimization, (3) effects on patient care, (4) technical considerations, and (5) implementation considerations. Clinicians valued the CDS system's multi-modal continuous monitoring and AI's capacity to process large volumes of data in real-time to identify patient risk factors and suggest action items. Participants underscored the system's unique value in detecting delirium and promoting non-pharmacological delirium prevention measures. The actionability and intuitive interpretation of the presented information was emphasized. ICU clinicians recognize the potential of an AI-driven CDS system for ICU delirium and acuity to improve patient outcomes and clinical workflows.
中文: 本研究通过ICU临床医生共同参与设计,开发了一种整合多模态数据的AI临床决策支持系统,用于实时评估患者病情恶化及谵妄风险,参与者强调该系统通过可操作警报和工作流程优化显著提升了护理质量。
English: This study engaged ICU clinicians in co-designing an AI-driven clinical decision support system that integrates multi-modal data for real-time risk assessment of patient decompensation and delirium, with participants highlighting its utility in enhancing care through actionable alerts and workflow optimization.

Authors:Yuanfang Ren, Andrea E. Davidson, Jiaqing Zhang, Miguel Contreras, Ayush K. Patel, Michelle Gumz, Tezcan Ozrazgat-Baslanti, Parisa Rashidi, Azra Bihorac
Title: Quantifying Circadian Desynchrony in ICU Patients and Its Association with Delirium
Abstract:
Background: Circadian desynchrony characterized by the misalignment between an individual's internal biological rhythms and external environmental cues, significantly affects various physiological processes and health outcomes. Quantifying circadian desynchrony often requires prolonged and frequent monitoring, and currently, an easy tool for this purpose is missing. Additionally, its association with the incidence of delirium has not been clearly explored. Methods: A prospective observational study was carried out in intensive care units (ICU) of a tertiary hospital. Circadian transcriptomics of blood monocytes from 86 individuals were collected on two consecutive days, although a second sample could not be obtained from all participants. Using two public datasets comprised of healthy volunteers, we replicated a model for determining internal circadian time. We developed an approach to quantify circadian desynchrony by comparing internal circadian time and external blood collection time. We applied the model and quantified circadian desynchrony index among ICU patients, and investigated its association with the incidence of delirium. Results: The replicated model for determining internal circadian time achieved comparable high accuracy. The quantified circadian desynchrony index was significantly higher among critically ill ICU patients compared to healthy subjects, with values of 10.03 hours vs 2.50-2.95 hours (p < 0.001). Most ICU patients had a circadian desynchrony index greater than 9 hours. Additionally, the index was lower in patients whose blood samples were drawn after 3pm, with values of 5.00 hours compared to 10.01-10.90 hours in other groups (p < 0.001)...
中文: 本研究开发了一种利用血液单核细胞转录组学量化ICU患者昼夜节律失调的方法,发现危重患者节律失调程度显著更高,且与下午采血时间相关。
English: This study developed a method to quantify circadian desynchrony in ICU patients using blood monocyte transcriptomics, revealing significantly higher desynchrony levels in critically ill patients and an association with afternoon blood collection times.

Authors:Ruiqi Zhang, Hao Zhu, Jingyi Zhao, Qi Zhang, Xun Cao, Zhan Ma
Title: Mitigating Ambiguities in 3D Classification with Gaussian Splatting
Abstract:
3D classification with point cloud input is a fundamental problem in 3D vision. However, due to the discrete nature and the insufficient material description of point cloud representations, there are ambiguities in distinguishing wire-like and flat surfaces, as well as transparent or reflective objects. To address these issues, we propose Gaussian Splatting (GS) point cloud-based 3D classification. We find that the scale and rotation coefficients in the GS point cloud help characterize surface types. Specifically, wire-like surfaces consist of multiple slender Gaussian ellipsoids, while flat surfaces are composed of a few flat Gaussian ellipsoids. Additionally, the opacity in the GS point cloud represents the transparency characteristics of objects. As a result, ambiguities in point cloud-based 3D classification can be mitigated utilizing GS point cloud as input. To verify the effectiveness of GS point cloud input, we construct the first real-world GS point cloud dataset in the community, which includes 20 categories with 200 objects in each category. Experiments not only validate the superiority of GS point cloud input, especially in distinguishing ambiguous objects, but also demonstrate the generalization ability across different classification methods.
中文摘要:高斯溅射点云通过尺度、旋转参数区分线状与平面结构,并利用不透明度表征物体透明度特性,有效解决了传统点云分类中的模糊性问题,实验基于首个真实世界数据集验证了其优越性和泛化能力。
English Summary: Gaussian Splatting point cloud representation enhances 3D classification by utilizing scale, rotation, and opacity parameters to resolve ambiguities in distinguishing wire-like structures, flat surfaces, and transparent objects, with experimental validation on a novel real-world dataset.

Authors:Shaobin Zhuang, Yiwei Guo, Yanbo Ding, Kunchang Li, Xinyuan Chen, Yaohui Wang, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang
Title: TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision
Abstract:
Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.
中文摘要:TimeStep Master (TSM) 范式通过分阶段培养时间步特定的LoRA专家并采用非对称多尺度协作进行集成,显著提升了扩散模型的微调效果,在多种任务和数据类型上均取得了领先性能。
English Summary: The TimeStep Master (TSM) paradigm enhances diffusion model fine-tuning by fostering timestep-specific LoRA experts and assembling them through asymmetrical multi-scale collaboration, achieving state-of-the-art results across diverse tasks and data types.

Authors:Ning Ding, Jing Han, Yuchuan Tian, Chao Xu, Kai Han, Yehui Tang
Title: Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping
Abstract:
Diffusion Transformer (DiT) has now become the preferred choice for building image generation models due to its great generation capability. Unlike previous convolution-based UNet models, DiT is purely composed of a stack of transformer blocks, which renders DiT excellent in scalability like large language models. However, the growing model size and multi-step sampling paradigm bring about considerable pressure on deployment and inference. In this work, we propose a post-training quantization framework tailored for Diffusion Transforms to tackle these challenges. We firstly locate that the quantization difficulty of DiT mainly originates from the time-dependent channel-specific outliers. We propose a timestep-aware shift-and-scale strategy to smooth the activation distribution to reduce the quantization error. Secondly, based on the observation that activations of adjacent timesteps have similar distributions, we utilize a hierarchical clustering scheme to divide the denoising timesteps into multiple groups. We further design a re-parameterization scheme which absorbs the quantization parameters into nearby module to avoid redundant computations. Comprehensive experiments demonstrate that out PTQ method successfully quantize the Diffusion Transformer into 8-bit weight and 8-bit activation (W8A8) with state-of-the-art FiD score. And our method can further quantize DiT model into 4-bit weight and 8-bit activation (W4A8) without sacrificing generation quality.
中文: 本文提出了一种针对扩散变换器的训练后量化框架,通过时间感知策略和分层聚类解决量化难题,实现了高效的8位和4位权重量化,同时保持卓越的图像生成质量。
English: This paper introduces a post-training quantization framework for Diffusion Transformers (DiT) that addresses quantization challenges through a timestep-aware strategy and hierarchical clustering, enabling efficient 8-bit and 4-bit weight quantization while maintaining high image generation quality.

Authors:Yinuo Liu, Zenghui Yuan, Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong
Title: Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Abstract:
Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning capability of vision-language models (VLMs) by dynamically accessing information from external knowledge bases. In this work, we introduce \textit{Poisoned-MRAG}, the first knowledge poisoning attack on multimodal RAG systems. Poisoned-MRAG injects a few carefully crafted image-text pairs into the multimodal knowledge database, manipulating VLMs to generate the attacker-desired response to a target query. Specifically, we formalize the attack as an optimization problem and propose two cross-modal attack strategies, dirty-label and clean-label, tailored to the attacker's knowledge and goals. Our extensive experiments across multiple knowledge databases and VLMs show that Poisoned-MRAG outperforms existing methods, achieving up to 98\% attack success rate with just five malicious image-text pairs injected into the InfoSeek database (481,782 pairs). Additionally, We evaluate 4 different defense strategies, including paraphrasing, duplicate removal, structure-driven mitigation, and purification, demonstrating their limited effectiveness and trade-offs against Poisoned-MRAG. Our results highlight the effectiveness and scalability of Poisoned-MRAG, underscoring its potential as a significant threat to multimodal RAG systems.
中文摘要:Poisoned-MRAG首次针对多模态RAG系统实施知识投毒攻击,仅注入五个恶意图文对即可实现高达98%的攻击成功率,同时现有防御措施均显示有限效果。
English Summary: Poisoned-MRAG introduces the first knowledge poisoning attack on multimodal RAG systems, achieving up to 98% attack success by injecting malicious image-text pairs while demonstrating limited effectiveness of existing defenses.

Authors:Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao
Title: A Survey on Post-training of Large Language Models
Abstract:
The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.
中文: 本文首次对后训练语言模型进行全面综述,系统追踪了其在微调、对齐、推理等五大范式中的演进,旨在解决大语言模型在专业领域中的推理局限与伦理问题,为构建更精准、鲁棒且多功能的AI系统建立研究框架。
English: This paper provides the first comprehensive survey of Post-training Language Models (PoLMs), systematically analyzing their evolution across five key paradigms to address limitations in reasoning, ethics, and domain-specific performance of Large Language Models, while establishing a framework for future research toward more precise and versatile AI systems.

Authors:Miguel Contreras, Jessica Sena, Andrea Davidson, Jiaqing Zhang, Tezcan Ozrazgat-Baslanti, Yuanfang Ren, Ziyuan Guan, Jeremy Balch, Tyler Loftus, Subhash Nerella, Azra Bihorac, Parisa Rashidi
Title: MANDARIN: Mixture-of-Experts Framework for Dynamic Delirium and Coma Prediction in ICU Patients: Development and Validation of an Acute Brain Dysfunction Prediction Model
Abstract:
Acute brain dysfunction (ABD) is a common, severe ICU complication, presenting as delirium or coma and leading to prolonged stays, increased mortality, and cognitive decline. Traditional screening tools like the Glasgow Coma Scale (GCS), Confusion Assessment Method (CAM), and Richmond Agitation-Sedation Scale (RASS) rely on intermittent assessments, causing delays and inconsistencies. In this study, we propose MANDARIN (Mixture-of-Experts Framework for Dynamic Delirium and Coma Prediction in ICU Patients), a 1.5M-parameter mixture-of-experts neural network to predict ABD in real-time among ICU patients. The model integrates temporal and static data from the ICU to predict the brain status in the next 12 to 72 hours, using a multi-branch approach to account for current brain status. The MANDARIN model was trained on data from 92,734 patients (132,997 ICU admissions) from 2 hospitals between 2008-2019 and validated externally on data from 11,719 patients (14,519 ICU admissions) from 15 hospitals and prospectively on data from 304 patients (503 ICU admissions) from one hospital in 2021-2024. Three datasets were used: the University of Florida Health (UFH) dataset, the electronic ICU Collaborative Research Database (eICU), and the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. MANDARIN significantly outperforms the baseline neurological assessment scores (GCS, CAM, and RASS) for delirium prediction in both external (AUROC 75.5% CI: 74.2%-76.8% vs 68.3% CI: 66.9%-69.5%) and prospective (AUROC 82.0% CI: 74.8%-89.2% vs 72.7% CI: 65.5%-81.0%) cohorts, as well as for coma prediction (external AUROC 87.3% CI: 85.9%-89.0% vs 72.8% CI: 70.6%-74.9%, and prospective AUROC 93.4% CI: 88.5%-97.9% vs 67.7% CI: 57.7%-76.8%) with a 12-hour lead time. This tool has the potential to assist clinicians in decision-making by continuously monitoring the brain status of patients in the ICU.
中文: MANDARIN模型通过整合ICU患者的动态与静态数据,能提前12-72小时预测谵妄和昏迷,其预测准确性显著优于传统神经评估量表。
English: The MANDARIN model, a real-time neural network, significantly outperforms traditional screening tools in predicting acute brain dysfunction in ICU patients by integrating temporal and static data to forecast delirium and coma 12-72 hours in advance.

Authors:Orfeas Menis Mastromichalakis, Giorgos Filandrianos, Maria Symeonaki, Giorgos Stamou
Title: Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms
Abstract:
Machine Translation (MT) systems frequently encounter gender-ambiguous occupational terms, where they must assign gender without explicit contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns-such as consistently translating certain professions with specific genders-can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we introduce GRAPE, a probability-based metric designed to evaluate gender bias by analyzing aggregated model responses. Alongside this, we present GAMBIT, a benchmarking dataset in English with gender-ambiguous occupational terms. Using GRAPE, we evaluate several MT systems and examine whether their gendered translations in Greek and French align with or diverge from societal stereotypes, real-world occupational gender distributions, and normative standards
中文摘要:本文提出了基于概率的GRAPE评估指标和GAMBIT基准数据集,用于分析机器翻译系统在处理性别模糊职业术语时的性别偏见问题,检验希腊语和法语翻译结果与社会刻板印象、实际职业性别分布及规范标准的一致性。
English Summary: The GRAPE metric and GAMBIT dataset are introduced to evaluate gender bias in machine translation systems when handling gender-ambiguous occupational terms, assessing whether translations in Greek and French align with stereotypes, real-world distributions, or normative standards.

Authors:Chengjin Li, Yuqian Chen, Nir A. Sochen, Wei Zhang, Carl-Fredrik Westin, Rathi Yogesh, Lauren J. O'Donnell, Ofer Pasternak, Fan Zhang
Title: DDCSR: A Novel End-to-End Deep Learning Framework for Cortical Surface Reconstruction from Diffusion MRI
Abstract:
Diffusion MRI (dMRI) plays a crucial role in studying brain white matter connectivity. Cortical surface reconstruction (CSR), including the inner whiter matter (WM) and outer pial surfaces, is one of the key tasks in dMRI analyses such as fiber tractography and multimodal MRI analysis. Existing CSR methods rely on anatomical T1-weighted data and map them into the dMRI space through inter-modality registration. However, due to the low resolution and image distortions of dMRI data, inter-modality registration faces significant challenges. This work proposes a novel end-to-end learning framework, DDCSR, which for the first time enables CSR directly from dMRI data. DDCSR consists of two major components, including: (1) an implicit learning module to predict a voxel-wise intermediate surface representation, and (2) an explicit learning module to predict the 3D mesh surfaces. Compared to several baseline and advanced CSR methods, we show that the proposed DDCSR can largely increase both accuracy and efficiency. Furthermore, we demonstrate a high generalization ability of DDCSR to data from different sources, despite the differences in dMRI acquisitions and populations.
中文: 提出的DDCSR框架通过端到端学习方法实现直接从扩散MRI数据重建皮层表面,在显著提升精度和效率的同时,展现出对不同数据源的强大泛化能力。
English: The proposed DDCSR framework enables direct cortical surface reconstruction from diffusion MRI data through an end-to-end learning approach, significantly improving accuracy and efficiency while demonstrating strong generalization across diverse datasets.

Authors:Huy Nguyen, Nhat Ho, Alessandro Rinaldo
Title: Convergence Rates for Softmax Gating Mixture of Experts
Abstract:
Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed \emph{strong identifiability} condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.
Chinese: 专家混合模型通过将任务分配给专业子模型来提高机器学习效率,本研究分析了softmax门控机制的收敛性和样本效率,发现强可识别性专家(如双层网络)仅需多项式数据,而线性专家则需指数级数据。
English: Mixture of Experts (MoE) enhances machine learning efficiency by distributing tasks among specialized sub-models, with this study analyzing the convergence and sample efficiency of softmax gating mechanisms, revealing that strongly identifiable experts like two-layer networks require polynomial data while linear experts need exponential data.

Authors:Junyi Wang, Mubai Du, Ye Wu, Yijie Li, William M. Wells, Lauren J. O'Donnell, Fan Zhang
Title: A Novel Streamline-based diffusion MRI Tractography Registration Method with Probabilistic Keypoint Detection
Abstract:
Registration of diffusion MRI tractography is an essential step for analyzing group similarities and variations in the brain's white matter (WM). Streamline-based registration approaches can leverage the 3D geometric information of fiber pathways to enable spatial alignment after registration. Existing methods usually rely on the optimization of the spatial distances to identify the optimal transformation. However, such methods overlook point connectivity patterns within the streamline itself, limiting their ability to identify anatomical correspondences across tractography datasets. In this work, we propose a novel unsupervised approach using deep learning to perform streamline-based dMRI tractography registration. The overall idea is to identify corresponding keypoint pairs across subjects for spatial alignment of tractography datasets. We model tractography as point clouds to leverage the graph connectivity along streamlines. We propose a novel keypoint detection method for streamlines, framed as a probabilistic classification task to identify anatomically consistent correspondences across unstructured streamline sets. In the experiments, we compare several existing methods and show highly effective and efficient tractography registration performance.
中文: 本文提出了一种基于深度学习的无监督扩散MRI纤维束成像配准方法,通过将纤维束建模为点云并采用关键点检测技术实现跨被试的解剖一致性对齐,实验表明该方法相比现有方法具有更优的性能和效率。
English: This paper introduces an unsupervised deep learning method for diffusion MRI tractography registration that models tractography as point clouds and uses keypoint detection to achieve anatomically consistent alignment across subjects, demonstrating superior performance and efficiency compared to existing approaches.

Authors:Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin
Title: Efficient Long Context Fine-tuning with Chunk Flow
Abstract:
Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.
中文:提出的ChunkFlow方法通过将变长序列重组为统一分块并采用状态感知调度机制,在长上下文大语言模型微调中实现了最优计算效率与内存控制,相比现有方法最高可提速4.53倍。
English: The proposed ChunkFlow method reorganizes variable-length sequences into uniform chunks and employs state-aware scheduling to optimize computational efficiency and memory usage, achieving up to 4.53x faster long-context fine-tuning for LLMs compared to existing approaches.

Authors:Md Abrar Jahin, Soudeep Shahriar, M. F. Mridha, Md. Jakir Hossen, Nilanjan Dey
Title: Soybean Disease Detection via Interpretable Hybrid CNN-GNN: Integrating MobileNetV2 and GraphSAGE with Cross-Modal Attention
Abstract:
Soybean leaf disease detection is critical for agricultural productivity but faces challenges due to visually similar symptoms and limited interpretability in conventional methods. While Convolutional Neural Networks (CNNs) excel in spatial feature extraction, they often neglect inter-image relational dependencies, leading to misclassifications. This paper proposes an interpretable hybrid Sequential CNN-Graph Neural Network (GNN) framework that synergizes MobileNetV2 for localized feature extraction and GraphSAGE for relational modeling. The framework constructs a graph where nodes represent leaf images, with edges defined by cosine similarity-based adjacency matrices and adaptive neighborhood sampling. This design captures fine-grained lesion features and global symptom patterns, addressing inter-class similarity challenges. Cross-modal interpretability is achieved via Grad-CAM and Eigen-CAM visualizations, generating heatmaps to highlight disease-influential regions. Evaluated on a dataset of ten soybean leaf diseases, the model achieves $97.16\%$ accuracy, surpassing standalone CNNs ($\le95.04\%$) and traditional machine learning models ($\le77.05\%$). Ablation studies validate the sequential architecture's superiority over parallel or single-model configurations. With only 2.3 million parameters, the lightweight MobileNetV2-GraphSAGE combination ensures computational efficiency, enabling real-time deployment in resource-constrained environments. The proposed approach bridges the gap between accurate classification and practical applicability, offering a robust, interpretable tool for agricultural diagnostics while advancing CNN-GNN integration in plant pathology research.
中文: 本文提出了一种可解释的混合顺序CNN-GNN框架,通过结合MobileNetV2的特征提取和GraphSAGE的关系建模,在大豆叶片病害分类中达到97.16%的准确率,并利用Grad-CAM和Eigen-CAM提供可视化解释。
EN: This paper introduces an interpretable hybrid Sequential CNN-GNN framework that combines MobileNetV2 for feature extraction and GraphSAGE for relational modeling, achieving 97.16% accuracy in soybean disease classification while providing visual explanations through Grad-CAM and Eigen-CAM.

Authors:Md Abrar Jahin, Shahriar Soudeep, Fahmid Al Farid, M. F. Mridha, Raihan Kabir, Md Rashedul Islam, Hezerul Abdul Karim
Title: CAGN-GAT Fusion: A Hybrid Contrastive Attentive Graph Neural Network for Network Intrusion Detection
Abstract:
Cybersecurity threats are growing, making network intrusion detection essential. Traditional machine learning models remain effective in resource-limited environments due to their efficiency, requiring fewer parameters and less computational time. However, handling short and highly imbalanced datasets remains challenging. In this study, we propose the fusion of a Contrastive Attentive Graph Network and Graph Attention Network (CAGN-GAT Fusion) and benchmark it against 15 other models, including both Graph Neural Networks (GNNs) and traditional ML models. Our evaluation is conducted on four benchmark datasets (KDD-CUP-1999, NSL-KDD, UNSW-NB15, and CICIDS2017) using a short and proportionally imbalanced dataset with a constant size of 5000 samples to ensure fairness in comparison. Results show that CAGN-GAT Fusion demonstrates stable and competitive accuracy, recall, and F1-score, even though it does not achieve the highest performance in every dataset. Our analysis also highlights the impact of adaptive graph construction techniques, including small changes in connections (edge perturbation) and selective hiding of features (feature masking), improving detection performance. The findings confirm that GNNs, particularly CAGN-GAT Fusion, are robust and computationally efficient, making them well-suited for resource-constrained environments. Future work will explore GraphSAGE layers and multiview graph construction techniques to further enhance adaptability and detection accuracy.
中文: 本研究提出的CAGN-GAT融合模型在网络入侵检测中展现出稳定且高效的性能,在数据不平衡情况下优于传统方法,特别适合资源受限的环境。
English: The proposed CAGN-GAT Fusion model demonstrates robust and computationally efficient performance in network intrusion detection, outperforming traditional methods on imbalanced datasets while remaining suitable for resource-constrained environments.

Authors:You Shen, Zhipeng Zhang, Xinyang Li, Yansong Qu, Yu Lin, Shengchuan Zhang, Liujuan Cao
Title: Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization
Abstract:
Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry reconstruction remains difficult. Existing methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks. However, the former suffers from inherent conflicts between rendering and reconstruction, while the latter is computationally and storage-intensive. To address these challenges, we propose CarGS, a unified model leveraging Contribution-adaptive regularization to achieve simultaneous, high-quality rendering and surface reconstruction. The essence of our framework is learning adaptive contribution for Gaussian primitives by squeezing the knowledge from geometry regularization into a compact MLP. Additionally, we introduce a geometry-guided densification strategy with clues from both normals and Signed Distance Fields (SDF) to improve the capability of capturing high-frequency details. Our design improves the mutual learning of the two tasks, meanwhile its unified structure does not require separate models as in dual-model based approaches, guaranteeing efficiency. Extensive experiments demonstrate the ability to achieve state-of-the-art (SOTA) results in both rendering fidelity and reconstruction accuracy while maintaining real-time speed and minimal storage size.
Chinese: CarGS提出了一种采用贡献自适应正则化的统一模型和几何引导的致密化策略,实现了高质量渲染与精确表面重建的同步进行,在保真度和效率上均超越现有方法,同时保持实时性能。
English: CarGS introduces a unified model with Contribution-adaptive regularization and a geometry-guided densification strategy to achieve simultaneous high-quality rendering and accurate surface reconstruction, outperforming existing methods in both fidelity and efficiency while maintaining real-time performance.

Authors:Xinliang Zhou, Chenyu Liu, Zhisheng Chen, Kun Wang, Yi Ding, Ziyu Jia, Qingsong Wen
Title: Brain Foundation Models: A Survey on Advancements in Neural Signal Processing and Brain Discovery
Abstract:
Brain foundation models (BFMs) have emerged as a transformative paradigm in computational neuroscience, offering a revolutionary framework for processing diverse neural signals across different brain-related tasks. These models leverage large-scale pre-training techniques, allowing them to generalize effectively across multiple scenarios, tasks, and modalities, thus overcoming the traditional limitations faced by conventional artificial intelligence (AI) approaches in understanding complex brain data. By tapping into the power of pretrained models, BFMs provide a means to process neural data in a more unified manner, enabling advanced analysis and discovery in the field of neuroscience. In this survey, we define BFMs for the first time, providing a clear and concise framework for constructing and utilizing these models in various applications. We also examine the key principles and methodologies for developing these models, shedding light on how they transform the landscape of neural signal processing. This survey presents a comprehensive review of the latest advancements in BFMs, covering the most recent methodological innovations, novel views of application areas, and challenges in the field. Notably, we highlight the future directions and key challenges that need to be addressed to fully realize the potential of BFMs. These challenges include improving the quality of brain data, optimizing model architecture for better generalization, increasing training efficiency, and enhancing the interpretability and robustness of BFMs in real-world applications.
中文: 脑基础模型(BFMs)作为计算神经科学中的变革性范式,通过大规模预训练技术统一处理多种神经信号,但在数据质量、模型优化和可解释性等方面仍需应对关键挑战以实现其潜力。
English: Brain foundation models (BFMs) represent a transformative approach in computational neuroscience, leveraging large-scale pre-training to process neural data across tasks and modalities, while facing challenges in data quality, model optimization, and interpretability for future advancement.

Authors:Claudio Cicconetti, Marco Conti, Andrea Passarella
Title: Uncoordinated Access to Serverless Computing in MEC Systems for IoT
Abstract:
Edge computing is a promising solution to enable low-latency IoT applications, by shifting computation from remote data centers to local devices, less powerful but closer to the end user devices. However, this creates the challenge on how to best assign clients to edge nodes offering compute capabilities. So far, two antithetical architectures are proposed: centralized resource orchestration or distributed overlay. In this work we explore a third way, called uncoordinated access, which consists in letting every device exploring multiple opportunities, to opportunistically embrace the heterogeneity of network and load conditions towards diverse edge nodes. In particular, our contribution is intended for emerging serverless IoT applications, which do not have a state on the edge nodes executing tasks. We model the proposed system as a set of M/M/1 queues and show that it achieves a smaller jitter delay than single edge node allocation. Furthermore, we compare uncoordinated access with state-of-the-art centralized and distributed alternatives in testbed experiments under more realistic conditions. Based on the results, our proposed approach, which requires a tiny fraction of the complexity of the alternatives in both the device and network components, is very effective in using the network resources, while incurring only a small penalty in terms of increased compute load and high percentiles of delay.
Chinese: 边缘计算通过将计算任务转移至本地设备来降低延迟,本文提出的无协调访问方法在减少抖动延迟方面优于集中式和分布式方案,且复杂度更低。
English: Edge computing reduces latency by shifting computation to local devices, and this paper proposes an uncoordinated access approach that outperforms centralized and distributed methods by minimizing jitter delay with lower complexity.

Authors:Lorenzo Tronchin, Tommy Löfstedt, Paolo Soda, Valerio Guarrasi
Title: Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation
Abstract:
The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.
中文摘要:本研究通过多目标优化平衡保真度与多样性,提出采用GAN集成方法生成高质量合成医学图像以提升诊断模型效能,并在三个医学数据集上通过110种配置验证了其有效性。
English Summary: This study proposes using GAN ensembles optimized through multi-objective balancing of fidelity and diversity to generate high-quality synthetic medical images that enhance diagnostic modeling, as validated across three medical datasets with 110 configurations.

Authors:Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Imran Razzak, Hakim Hacid, Sunil Aryal
Title: SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
Abstract:
Large vision and language models show strong performance in tasks like image captioning, visual question answering, and retrieval. However, challenges remain in integrating speech, text, and vision into a unified model, especially for spoken tasks. Speech generation methods vary (some produce speech directly), others through text (but their impact on quality is unclear). Evaluation often relies on automatic speech recognition, which may introduce bias. We propose SVLA, a unified speech vision language model based on a transformer architecture that handles multimodal inputs and outputs. We train it on 38.2 million speech text image examples, including 64.1 hours of synthetic speech. We also introduce Speech VQA Accuracy, a new metric for evaluating spoken responses. SVLA improves multimodal understanding and generation by better combining speech, vision, and language.
中文: 大型视觉语言模型在图像描述等任务中表现出色,但在整合语音方面存在挑战,因此提出了基于Transformer的统一模型SVLA,该模型通过3820万多模态样本训练,提升了语音、视觉和语言的理解与生成能力。
English: Large vision-language models excel in tasks like image captioning but struggle with integrating speech, leading to the proposal of SVLA, a unified transformer-based model trained on 38.2 million multimodal examples that enhances understanding and generation across speech, vision, and language.

Authors:Yixiu Liu, Zehui He, Yuyuan Li, Zhongxuan Han, Chaochao Chen, Xiaolin Zheng
Title: Reproducibility Companion Paper:In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems
Abstract:
In this paper, we reproduce experimental results presented in our earlier work titled "In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems" that was presented in the proceeding of the 31st ACM International Conference on Multimedia.This work aims to verify the effectiveness of our previously proposed method and provide guidance for reproducibility. We present detailed descriptions of our preprocessed datasets, the structure of our source code, configuration file settings, experimental environment, and the reproduced experimental results.
中文: 本文重现了我们先前关于用户约束主导集研究的实验结果,以验证方法的有效性并提供可复现性指导,详细描述了数据集、代码结构、配置设置及实验结果。
English: This paper reproduces the experimental results from our prior work on user-constrained dominant sets to verify the method's effectiveness and provide reproducibility guidance, detailing datasets, code structure, configurations, and results.

Authors:Yuyuan Li, Junjie Fang, Chaochao Chen, Xiaolin Zheng, Yizhao Zhang, Zhongxuan Han
Title: Reproducibility Companion Paper: Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems
Abstract:
In this paper, we reproduce the experimental results presented in our previous work titled "Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems," which was published in the proceedings of the 31st ACM International Conference on Multimedia. This paper aims to validate the effectiveness of our proposed method and help others reproduce our experimental results. We provide detailed descriptions of our preprocessed datasets, source code structure, configuration file settings, experimental environment, and reproduced experimental results.
中文: 本文成功复现了先前关于推荐系统中属性级遗忘方法的研究结果,验证了该方法的有效性,并为可复现性提供了完整的实现细节。
English: This paper successfully reproduces the experimental results from the authors' prior work on attribute-wise unlearning in recommender systems, validating the method's effectiveness and providing comprehensive implementation details for reproducibility.

Authors:Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi
Title: DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
Abstract:
Most 3D object generators prioritize aesthetic quality, often neglecting the physical constraints necessary for practical applications. One such constraint is that a 3D object should be self-supporting, i.e., remain balanced under gravity. Previous approaches to generating stable 3D objects relied on differentiable physics simulators to optimize geometry at test time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models with external feedback, we propose Direct Simulation Optimization (DSO). This framework leverages feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator directly outputs stable 3D objects. We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO) - a novel objective we introduce to align diffusion models without requiring pairwise preferences. Our experiments demonstrate that the fine-tuned feed-forward generator, using either the DPO or DRO objective, is significantly faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework functions even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation feedback on its own outputs.
中文: 提出的直接仿真优化(DSO)框架通过物理模拟器反馈微调3D生成器,显著提升了物体稳定性,比测试时优化方法更快更可靠。
English: The proposed Direct Simulation Optimization (DSO) framework enhances 3D object stability by fine-tuning generators with physics simulator feedback, achieving faster and more reliable results than test-time optimization methods.

Authors:Long Gao, Yunhe Zhang, Langkun Chen, Yan Jiang, Weiying Xie, Yunsong Li
Title: Hyperspectral Adapter for Object Tracking based on Hyperspectral Video
Abstract:
Object tracking based on hyperspectral video attracts increasing attention to the rich material and motion information in the hyperspectral videos. The prevailing hyperspectral methods adapt pretrained RGB-based object tracking networks for hyperspectral tasks by fine-tuning the entire network on hyperspectral datasets, which achieves impressive results in challenging scenarios. However, the performance of hyperspectral trackers is limited by the loss of spectral information during the transformation, and fine-tuning the entire pretrained network is inefficient for practical applications. To address the issues, a new hyperspectral object tracking method, hyperspectral adapter for tracking (HyA-T), is proposed in this work. The hyperspectral adapter for the self-attention (HAS) and the hyperspectral adapter for the multilayer perceptron (HAM) are proposed to generate the adaption information and to transfer the multi-head self-attention (MSA) module and the multilayer perceptron (MLP) in pretrained network for the hyperspectral object tracking task by augmenting the adaption information into the calculation of the MSA and MLP. Additionally, the hyperspectral enhancement of input (HEI) is proposed to augment the original spectral information into the input of the tracking network. The proposed methods extract spectral information directly from the hyperspectral images, which prevent the loss of the spectral information. Moreover, only the parameters in the proposed methods are fine-tuned, which is more efficient than the existing methods. Extensive experiments were conducted on four datasets with various spectral bands, verifing the effectiveness of the proposed methods. The HyA-T achieves state-of-the-art performance on all the datasets.
中文: 提出的HyA-T方法通过专用适配器和输入增强技术保留高光谱目标跟踪中的光谱信息,无需完整网络微调即可高效实现最优性能。
English: The proposed HyA-T method introduces specialized adapters and input enhancement to preserve spectral information in hyperspectral object tracking, achieving state-of-the-art performance efficiently without full network fine-tuning.

Authors:Ziren Gong, Fabio Tosi, Youmin Zhang, Stefano Mattoccia, Matteo Poggi
Title: HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM
Abstract:
NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
中文:HS-SLAM通过融合混合编码网络提升场景表示能力,采用非局部像素块采样进行结构监督以优化结构捕捉,并实施主动全局光束法平差确保全局一致性,在保持效率的同时显著提高了跟踪与重建精度。
English: HS-SLAM enhances NeRF-based SLAM by integrating a hybrid encoding network for improved scene representation, structural supervision through patch sampling for better structural capture, and active global bundle adjustment to maintain global consistency, achieving superior tracking and reconstruction accuracy efficiently.

Authors:Shuo Liu, Minghui Xu, Tianyi Sun, Xiuzhen Cheng
Title: Asynchronous BFT Consensus Made Wireless
Abstract:
Asynchronous Byzantine fault-tolerant (BFT) consensus protocols, known for their robustness in unpredictable environments without relying on timing assumptions, are becoming increasingly vital for wireless applications. While these protocols have proven effective in wired networks, their adaptation to wireless environments presents significant challenges. Asynchronous BFT consensus, characterized by its N parallel consensus components (e.g., asynchronous Byzantine agreement, reliable broadcast), suffers from high message complexity, leading to network congestion and inefficiency, especially in resource-constrained wireless networks. Asynchronous Byzantine agreement (ABA) protocols, a foundational component of asynchronous BFT, require careful balancing of message complexity and cryptographic overhead to achieve efficient implementation in wireless settings. Additionally, the absence of dedicated testbeds for asynchronous wireless BFT consensus protocols hinders development and performance evaluation. To address these challenges, we propose a consensus batching protocol (ConsensusBatcher), which supports both vertical and horizontal batching of multiple parallel consensus components. We leverage ConsensusBatcher to adapt three asynchronous BFT consensus protocols (HoneyBadgerBFT, BEAT, and Dumbo) from wired networks to resource-constrained wireless networks. To evaluate the performance of ConsensusBatcher-enabled consensus protocols in wireless environments, we develop and open-source a testbed for deployment and performance assessment of these protocols. Using this testbed, we demonstrate that ConsensusBatcher-based consensus reduces latency by 48% to 59% and increases throughput by 48% to 62% compared to baseline consensus protocols.
中文: 提出的ConsensusBatcher协议通过批处理并行组件优化了无线网络中的异步BFT共识,在开源测试平台上实现延迟降低48-59%、吞吐量提升48-62%的显著性能改进。
English: The proposed ConsensusBatcher protocol enhances asynchronous BFT consensus for wireless networks by batching parallel components, significantly reducing latency by 48-59% and boosting throughput by 48-62% while providing an open-source testbed for evaluation.

Authors:Yiran Cheng, Ting Zhang, Lwin Khin Shar, Zhe Lang, David Lo, Shichao Lv, Dongliang Fang, Zhiqiang Shi, Limin Sun
Title: Fixseeker: An Empirical Driven Graph-based Approach for Detecting Silent Vulnerability Fixes in Open Source Software
Abstract:
Open source software vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulnerability fixes. There are a few approaches for detecting vulnerability-fixing commits (VFCs) but most of these approaches leverage commit messages, which would miss silent VFCs. On the other hand, there are some approaches for detecting silent VFCs based on code change patterns but they often fail to adequately characterize vulnerability fix patterns, thereby lacking effectiveness. For example, some approaches analyze each hunk in known VFCs, in isolation, to learn vulnerability fix patterns; but vulnerabiliy fixes are often associated with multiple hunks, in which cases correlations of code changes across those hunks are essential for characterizing the vulnerability fixes. To address these problems, we first conduct a large-scale empirical study on 11,900 VFCs across six programming languages, in which we found that over 70% of VFCs involve multiple hunks with various types of correlations. Based on our findings, we propose Fixseeker, a graph-based approach that extracts the various correlations between code changes at the hunk level to detect silent vulnerability fixes. Our evaluation demonstrates that Fixseeker outperforms state-of-the-art approaches across multiple programming languages, achieving a high F1 score of 0.8404 on average in balanced datasets and consistently improving F1 score, AUC-ROC and AUC-PR scores by 32.40%, 1.55% and 8.24% on imbalanced datasets. Our evaluation also indicates the generality of Fixseeker across different repository sizes and commit complexities.
中文: 本研究提出了Fixseeker,一种基于图的方法,通过分析多个代码块之间的关联来有效检测无提示的漏洞修复提交,在多种编程语言中显著优于现有方法。
English: The study introduces Fixseeker, a graph-based method that effectively detects silent vulnerability-fixing commits by analyzing correlations between code changes across multiple hunks, significantly outperforming existing approaches in various programming languages.

Authors:Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose Enrique Chen, Xi Chen, Hao-Tien Lewis Chiang, Krzysztof Choromanski, David D'Ambrosio, Sudeep Dasari, Todor Davchev, Coline Devin, Norman Di Palo, Tianli Ding, Adil Dostmohamed, Danny Driess, Yilun Du, Debidatta Dwibedi, Michael Elabd, Claudio Fantacci, Cody Fong, Erik Frey, Chuyuan Fu, Marissa Giustina, Keerthana Gopalakrishnan, Laura Graesser, Leonard Hasenclever, Nicolas Heess, Brandon Hernaez, Alexander Herzog, R. Alex Hofer, Jan Humplik, Atil Iscen, Mithun George Jacob, Deepali Jain, Ryan Julian, Dmitry Kalashnikov, M. Emre Karagozler, Stefani Karp, Chase Kew, Jerad Kirkland, Sean Kirmani, Yuheng Kuang, Thomas Lampe, Antoine Laurens, Isabel Leal, Alex X. Lee, Tsang-Wei Edward Lee, Jacky Liang, Yixin Lin, Sharath Maddineni, Anirudha Majumdar, Assaf Hurwitz Michaely, Robert Moreno, Michael Neunert, Francesco Nori, Carolina Parada, Emilio Parisotto, Peter Pastor, Acorn Pooley, Kanishka Rao, Krista Reymann, Dorsa Sadigh, Stefano Saliceti, Pannag Sanketi, Pierre Sermanet, Dhruv Shah, Mohit Sharma, Kathryn Shea, Charles Shu, Vikas Sindhwani, Sumeet Singh, Radu Soricut, Jost Tobias Springenberg, Rachel Sterneck, Razvan Surdulescu, Jie Tan, Jonathan Tompson, Vincent Vanhoucke, Jake Varley, Grace Vesom, Giulia Vezzani, Oriol Vinyals, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Fei Xia, Ted Xiao, Annie Xie, Jinyu Xie, Peng Xu, Sichun Xu, Ying Xu, Zhuo Xu, Yuxiang Yang, Rui Yao, Sergey Yaroshenko, Wenhao Yu, Wentao Yuan, Jingwei Zhang, Tingnan Zhang, Allan Zhou, Yuxiang Zhou
Title: Gemini Robotics: Bringing AI into the Physical World
Abstract:
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
中文:Gemini Robotics系列推出了专为物理智能体设计的先进视觉-语言-动作模型,能够实现稳健操作并适应新型任务,同时解决了机器人应用中的安全性问题。
English: The Gemini Robotics family introduces advanced Vision-Language-Action models designed for physical agents, enabling robust manipulation and adaptation to novel tasks while addressing safety in robotics applications.

Authors:Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai
Title: ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Abstract:
End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
Chinese: 当前端到端自动驾驶系统在交互式闭环评估中因果推理能力不足,而提出的ORION框架通过融合视觉语言模型与轨迹预测,在Bench2Drive数据集上实现了超越现有最佳方法的性能表现。
English: Current end-to-end autonomous driving systems struggle with causal reasoning in interactive closed-loop evaluations, but the proposed ORION framework overcomes this by integrating vision-language models with trajectory prediction to achieve state-of-the-art performance on Bench2Drive datasets.

Authors:Bokai Cao, Xueyuan Lin, Yiyan Qi, Chengjin Xu, Cehao Yang, Jian Guo
Title: Financial Wind Tunnel: A Retrieval-Augmented Market Simulator
Abstract:
Market simulator tries to create high-quality synthetic financial data that mimics real-world market dynamics, which is crucial for model development and robust assessment. Despite continuous advancements in simulation methodologies, market fluctuations vary in terms of scale and sources, but existing frameworks often excel in only specific tasks. To address this challenge, we propose Financial Wind Tunnel (FWT), a retrieval-augmented market simulator designed to generate controllable, reasonable, and adaptable market dynamics for model testing. FWT offers a more comprehensive and systematic generative capability across different data frequencies. By leveraging a retrieval method to discover cross-sectional information as the augmented condition, our diffusion-based simulator seamlessly integrates both macro- and micro-level market patterns. Furthermore, our framework allows the simulation to be controlled with wide applicability, including causal generation through "what-if" prompts or unprecedented cross-market trend synthesis. Additionally, we develop an automated optimizer for downstream quantitative models, using stress testing of simulated scenarios via FWT to enhance returns while controlling risks. Experimental results demonstrate that our approach enables the generalizable and reliable market simulation, significantly improve the performance and adaptability of downstream models, particularly in highly complex and volatile market conditions. Our code and data sample is available at https://anonymous.4open.science/r/fwt_-E852
Chinese: 金融风洞(FWT)是一种检索增强的市场模拟器,通过整合宏观和微观市场模式生成可控且适应性强的合成金融数据,显著提升了下游量化模型在波动市场中的性能和可靠性。
English: The Financial Wind Tunnel (FWT) is a retrieval-augmented market simulator that generates controllable and adaptable synthetic financial data by integrating macro- and micro-level market patterns, significantly improving the performance and reliability of downstream quantitative models in volatile conditions.

Authors:Li Zhang, Chaochao Chen, Zhongxuan Han, Qiyong Zhong, Xiaolin Zheng
Title: LoGoFair: Post-Processing for Local and Global Fairness in Federated Learning
Abstract:
Federated learning (FL) has garnered considerable interest for its capability to learn from decentralized data sources. Given the increasing application of FL in decision-making scenarios, addressing fairness issues across different sensitive groups (e.g., female, male) in FL is crucial. Current research often focuses on facilitating fairness at each client's data (local fairness) or within the entire dataset across all clients (global fairness). However, existing approaches that focus exclusively on either local or global fairness fail to address two key challenges: (\textbf{CH1}) Under statistical heterogeneity, global fairness does not imply local fairness, and vice versa. (\textbf{CH2}) Achieving fairness under model-agnostic setting. To tackle the aforementioned challenges, this paper proposes a novel post-processing framework for achieving both Local and Global Fairness in the FL context, namely LoGoFair. To address CH1, LoGoFair endeavors to seek the Bayes optimal classifier under local and global fairness constraints, which strikes the optimal accuracy-fairness balance in the probabilistic sense. To address CH2, LoGoFair employs a model-agnostic federated post-processing procedure that enables clients to collaboratively optimize global fairness while ensuring local fairness, thereby achieving the optimal fair classifier within FL. Experimental results on three real-world datasets further illustrate the effectiveness of the proposed LoGoFair framework.
Chinese: 本文提出LoGoFair这一新型联邦学习框架,通过寻求贝叶斯最优分类器并采用模型无关的后处理方法,同时解决局部公平性和全局公平性挑战。
English: This paper introduces LoGoFair, a novel federated learning framework that addresses both local and global fairness challenges by seeking the Bayes optimal classifier under fairness constraints and employing a model-agnostic post-processing approach.

Authors:Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Title: SynCity: Training-Free Generation of 3D Worlds
Abstract:
We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.
中文: SynCity提出了一种无需训练的优化方法,通过结合预训练3D生成模型的几何精度与2D图像生成器的艺术多样性,采用分块生成策略从文本描述构建细节丰富的大规模3D场景。
English: SynCity presents a training-free method that combines pre-trained 3D generative models with 2D image generators through a tile-based approach to create expansive, high-quality 3D worlds from text descriptions.

Authors:Wali Ullah Khan, Chandan Kumar Sheemar, Eva Lagunas, Symeon Chatzinotas
Title: Enhancing Physical Layer Security in Cognitive Radio-Enabled NTNs with Beyond Diagonal RIS
Abstract:
Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) have emerged as a transformative technology for enhancing wireless communication by intelligently manipulating the propagation environment. This paper explores the potential of BD-RIS in improving cognitive radio enabled multilayer non-terrestrial networks (NTNs). It is assumed that a high-altitude platform station (HAPS) has set up the primary network, while an uncrewed aerial vehicle (UAV) establishes the secondary network in the HAPS footprint. We formulate a joint optimization problem to maximize the secrecy rate by optimizing BD-RIS phase shifts and the secondary transmitter power allocation while controlling the interference temperature from the secondary network to the primary network. To solve this problem efficiently, we decouple the original problem into two sub-problems, which are solved iteratively by relying on alternating optimization. Simulation results demonstrate the effectiveness of BD-RIS in cognitive radio-enabled multilayer NTNs to accommodate the secondary network while satisfying the constraints imposed from the primary network.
中文: 超对角可重构智能表面技术通过优化相位偏移和功率分配,在认知无线电多层非地面网络中有效提升保密速率,同时确保次级网络对主网络的干扰得到控制。
English: BD-RIS technology enhances cognitive radio-enabled multilayer non-terrestrial networks by optimizing phase shifts and power allocation to maximize secrecy rates while controlling interference between primary and secondary networks.

Authors:Valerio Guarrasi, Francesco Di Feola, Rebecca Restivo, Lorenzo Tronchin, Paolo Soda
Title: Whole-Body Image-to-Image Translation for a Virtual Scanner in a Healthcare Digital Twin
Abstract:
Generating positron emission tomography (PET) images from computed tomography (CT) scans via deep learning offers a promising pathway to reduce radiation exposure and costs associated with PET imaging, improving patient care and accessibility to functional imaging. Whole-body image translation presents challenges due to anatomical heterogeneity, often limiting generalized models. We propose a framework that segments whole-body CT images into four regions-head, trunk, arms, and legs-and uses district-specific Generative Adversarial Networks (GANs) for tailored CT-to-PET translation. Synthetic PET images from each region are stitched together to reconstruct the whole-body scan. Comparisons with a baseline non-segmented GAN and experiments with Pix2Pix and CycleGAN architectures tested paired and unpaired scenarios. Quantitative evaluations at district, whole-body, and lesion levels demonstrated significant improvements with our district-specific GANs. Pix2Pix yielded superior metrics, ensuring precise, high-quality image synthesis. By addressing anatomical heterogeneity, this approach achieves state-of-the-art results in whole-body CT-to-PET translation. This methodology supports healthcare Digital Twins by enabling accurate virtual PET scans from CT data, creating virtual imaging representations to monitor, predict, and optimize health outcomes.
中文: 该框架将全身CT图像分割为四个解剖区域并采用分区专用生成对抗网络进行CT到PET的转换,通过解决解剖异质性难题实现了最优的全身图像转换效果,为医疗数字孪生提供精确的虚拟PET扫描能力。
English: The proposed framework segments whole-body CT scans into four anatomical regions and employs region-specific GANs for CT-to-PET translation, achieving state-of-the-art results by addressing anatomical heterogeneity and enabling virtual PET imaging for healthcare Digital Twins.

Authors:Francesco Di Feola, Ludovica Pompilio, Cecilia Assolito, Valerio Guarrasi, Paolo Soda
Title: Texture-Aware StarGAN for CT data harmonisation
Abstract:
Computed Tomography (CT) plays a pivotal role in medical diagnosis; however, variability across reconstruction kernels hinders data-driven approaches, such as deep learning models, from achieving reliable and generalized performance. To this end, CT data harmonization has emerged as a promising solution to minimize such non-biological variances by standardizing data across different sources or conditions. In this context, Generative Adversarial Networks (GANs) have proved to be a powerful framework for harmonization, framing it as a style-transfer problem. However, GAN-based approaches still face limitations in capturing complex relationships within the images, which are essential for effective harmonization. In this work, we propose a novel texture-aware StarGAN for CT data harmonization, enabling one-to-many translations across different reconstruction kernels. Although the StarGAN model has been successfully applied in other domains, its potential for CT data harmonization remains unexplored. Furthermore, our approach introduces a multi-scale texture loss function that embeds texture information across different spatial and angular scales into the harmonization process, effectively addressing kernel-induced texture variations. We conducted extensive experimentation on a publicly available dataset, utilizing a total of 48667 chest CT slices from 197 patients distributed over three different reconstruction kernels, demonstrating the superiority of our method over the baseline StarGAN.
中文: 本研究提出了一种具有多尺度纹理损失函数的纹理感知StarGAN模型,用于改进不同重建核间的CT数据协调,在应对核诱导纹理变异方面展现出优于基准方法的性能。
English: This study introduces a texture-aware StarGAN model with a multi-scale texture loss function to enhance CT data harmonization across different reconstruction kernels, demonstrating superior performance over baseline methods in handling kernel-induced texture variations.

Authors:Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, Xinchao Wang
Title: POSTA: A Go-to Framework for Customized Artistic Poster Generation
Abstract:
Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster's aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA's exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.
中文摘要:POSTA是一个基于扩散模型和多模态大语言模型的模块化框架,能够生成具有更高文本准确性、美学吸引力和用户定制化的艺术海报,在视觉质量和设计多样性方面优于现有方法。
English Summary: POSTA is a modular framework using diffusion models and MLLMs to generate customized artistic posters with enhanced text accuracy, aesthetic appeal, and user control, outperforming existing methods in visual quality and design flexibility.

Authors:Thien Duc Hua, Mohammadali Mohammadi, Hien Quoc Ngo, Michail Matthaiou
Title: SWIPT in Cell-Free Massive MIMO Using Stacked Intelligent Metasurfaces
Abstract:
We investigate the integration of stacked intelligent metasurfaces (SIMs) into cell-free massive multiple input multiple output (CF-mMIMO) system to enhance the simultaneous wireless information and power transfer (SWIPT) performance. Closed-form expressions for the spectral efficiency (SE) of the information-decoding receivers (IRs) and the average sum of harvested energy (sum-HE) at the energy-harvesting receivers (ERs) in the novel system model are derived to subsequently formulate a maximum total average sum-HE problem under a minimum SE threshold per each IR. This problem jointly optimizes the SIM phase-shift (PS) configuration and access points' (APs) power allocation, relying on long-term statistical channel state information (CSI). This non-convex problem is then transformed into more tractable forms. Then, efficient algorithms are proposed, including a layer-by-layer heuristic method for SIMs PS configuration that prioritizes sum-HE for the ERs and a successive convex approximation (SCA)-based power allocation scheme to improve the achievable SE for the IRs. Numerical results show that our proposed algorithms achieve an almost 7-fold sum-HE gain as we increase the number of SIM layers, while the proposed power allocation (PPA) scheme often gains up to 40% in terms of the achievable minimum SE, compared to the equal power allocation.
中文: 本研究将堆叠智能超表面集成到无蜂窝大规模MIMO系统中,以提升无线信息和能量同传性能,所提出的高效算法在能量收集和频谱效率方面均实现了显著增益。
English: This study integrates stacked intelligent metasurfaces into cell-free massive MIMO systems to enhance simultaneous wireless information and power transfer, proposing efficient algorithms that achieve significant gains in both harvested energy and spectral efficiency.

Authors:Luca Collini, Andrew Hennessee, Ramesh Karri, Siddharth Garg
Title: Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective
Abstract:
Recent Large Language Models (LLMs) such as OpenAI o3-mini and DeepSeek-R1 use enhanced reasoning through Chain-of-Thought (CoT). Their potential in hardware design, which relies on expert-driven iterative optimization, remains unexplored. This paper investigates whether reasoning LLMs can address challenges in High-Level Synthesis (HLS) design space exploration and optimization. During HLS, engineers manually define pragmas/directives to balance performance and resource constraints. We propose an LLM-based optimization agentic framework that automatically restructures code, inserts pragmas, and identifies optimal design points via feedback from HLs tools and access to integer-linear programming (ILP) solvers. Experiments compare reasoning models against conventional LLMs on benchmarks using success rate, efficiency, and design quality (area/latency) metrics, and provide the first-ever glimpse into the CoTs produced by a powerful open-source reasoning model like DeepSeek-R1.
中文:本文研究具备推理能力的大语言模型如何通过代码重构、指令插入及反馈驱动搜索,实现高层次综合设计的自动化优化。
English: This paper explores the application of reasoning-enhanced Large Language Models for automating High-Level Synthesis design optimization through code restructuring, pragma insertion, and feedback-driven search.

Authors:Zhihao Zeng, Ziquan Fang, Yuting Huang, Lu Chen, Yunjun Gao
Title: Effective and Efficient Cross-City Traffic Knowledge Transfer: A Privacy-Preserving Perspective
Abstract:
Traffic prediction targets forecasting future traffic conditions using historical traffic data, serving a critical role in urban computing and transportation management. To mitigate the scarcity of traffic data while maintaining data privacy, numerous Federated Traffic Knowledge Transfer (FTT) approaches have been developed, which use transfer learning and federated learning to transfer traffic knowledge from data-rich cities to data-scarce cities, enhancing traffic prediction capabilities for the latter. However, current FTT approaches face challenges such as privacy leakage, cross-city data distribution discrepancies, low data quality, and inefficient knowledge transfer, limiting their privacy protection, effectiveness, robustness, and efficiency in real-world applications. To this end, we propose FedTT, an effective, efficient, and privacy-aware cross-city traffic knowledge transfer framework that transforms the traffic data domain from the data-rich cities and trains traffic models using the transformed data for the data-scarce cities. First, to safeguard data privacy, we propose a traffic secret transmission method that securely transmits and aggregates traffic domain-transformed data from source cities using a lightweight secret aggregation approach. Second, to mitigate the impact of traffic data distribution discrepancies on model performance, we introduce a traffic domain adapter to uniformly transform traffic data from the source cities' domains to that of the target city. Third, to improve traffic data quality, we design a traffic view imputation method to fill in and predict missing traffic data. Finally, to enhance transfer efficiency, FedTT is equipped with a federated parallel training method that enables the simultaneous training of multiple modules. Extensive experiments using 4 real-life datasets demonstrate that FedTT outperforms the 14 state-of-the-art baselines.
Chinese: 交通预测对城市管理至关重要,FedTT框架通过安全数据转换和联邦训练解决了跨城市知识转移中的隐私、数据差异及效率问题,在真实测试中优于现有方法。
English: Traffic prediction is crucial for urban management, and the proposed FedTT framework addresses privacy, data discrepancies, and efficiency issues in cross-city knowledge transfer through secure data transformation and federated training, outperforming existing methods in real-world tests.

Authors:Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao
Title: MTV-Inpaint: Multi-Task Long Video Inpainting
Abstract:
Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.
中文:MTV-Inpaint是一个统一的视频修复框架,通过双分支空间注意力机制整合场景补全和对象插入,支持多模态控制并能高效处理长视频。
English: MTV-Inpaint is a unified video inpainting framework that integrates scene completion and object insertion through a dual-branch spatial attention mechanism, supporting multimodal control and efficient long-video processing.

Authors:Anirban Chandra, Marius Koch, Suraj Pawar, Aniruddha Panda, Kamyar Azizzadenesheli, Jeroen Snippe, Faruk O. Alpak, Farah Hariri, Clement Etienam, Pandu Devarakota, Anima Anandkumar, Detlef Hohl
Title: Fourier Neural Operator based surrogates for $CO_2$ storage in realistic geologies
Abstract:
This study aims to develop surrogate models for accelerating decision making processes associated with carbon capture and storage (CCS) technologies. Selection of sub-surface $CO_2$ storage sites often necessitates expensive and involved simulations of $CO_2$ flow fields. Here, we develop a Fourier Neural Operator (FNO) based model for real-time, high-resolution simulation of $CO_2$ plume migration. The model is trained on a comprehensive dataset generated from realistic subsurface parameters and offers $O(10^5)$ computational acceleration with minimal sacrifice in prediction accuracy. We also explore super-resolution experiments to improve the computational cost of training the FNO based models. Additionally, we present various strategies for improving the reliability of predictions from the model, which is crucial while assessing actual geological sites. This novel framework, based on NVIDIA's Modulus library, will allow rapid screening of sites for CCS. The discussed workflows and strategies can be applied to other energy solutions like geothermal reservoir modeling and hydrogen storage. Our work scales scientific machine learning models to realistic 3D systems that are more consistent with real-life subsurface aquifers/reservoirs, paving the way for next-generation digital twins for subsurface CCS applications.
中文: 本研究开发了一种基于傅里叶神经算子的模型,通过实现实时高分辨率CO₂羽流模拟,以最小精度损失大幅提升计算速度,从而加速碳捕集与封存的决策过程。
English: This research develops a Fourier Neural Operator-based model to accelerate carbon capture and storage decision-making by enabling real-time, high-resolution CO₂ plume simulations with significant computational speedup and minimal accuracy loss.

Authors:Wali Ullah Khan, Chandan Kumar Sheemar, Eva Lagunas, Symeon Chatzinotas
Title: Beyond Diagonal RIS Enhanced Cognitive Radio Enabled Multilayer Non-Terrestrial Networks
Abstract:
Beyond diagonal reconfigurable intelligent surfaces (BD-RIS) have emerged as a transformative technology for enhancing wireless communication by intelligently manipulating the propagation environment. Its interconnected elements offer enhanced control over signal redirection, making it a promising solution for integrated terrestrial and non-terrestrial networks (NTNs). This paper explores the potential of BD-RIS in improving cognitive radio enabled multilayer non-terrestrial networks. We formulate a joint optimization problem that maximizes the achievable spectral efficiency by optimizing BD-RIS phase shifts and secondary transmitter power allocation while controlling the interference temperature from the secondary network to the primary network. To solve this problem efficiently, we decouple the original problem and propose a novel solution based on an alternating optimization approach. Simulation results demonstrate the effectiveness of BD-RIS in cognitive radio enabled multilayer NTNs.
Chinese: 本文展示了BD-RIS技术在认知无线电多层非地面网络中的潜力,通过联合优化相位偏移和功率分配,有效提升了频谱效率并控制了对主网络的干扰。
English: BD-RIS technology enhances wireless communication by intelligently manipulating the propagation environment, with this paper demonstrating its effectiveness in cognitive radio-enabled multilayer non-terrestrial networks through joint optimization of phase shifts and power allocation.

Authors:Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
Title: CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Abstract:
Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.
中文: 本文提出CINEMA框架,通过利用多模态大语言模型从独立参考图像生成连贯的多主体视频,无需建立文本与主体的显式对应关系,显著提升了主体一致性和视频连贯性。
English: The paper introduces CINEMA, a novel framework that leverages Multimultimodal Large Language Models to generate coherent multi-subject videos from separate reference images without requiring explicit text-subject correspondences, significantly improving subject consistency and video coherence.

Authors:Dongping Li, Tielong Cai, Tianci Tang, Wenhao Chai, Katherine Rose Driggs-Campbell, Gaoang Wang
Title: EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Abstract:
Developing autonomous home robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we propose Embodied Mobile Manipulation in Open Environments (EMMOE), a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect~\dataset, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design~\model, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate~\model's performance and evaluations of different models and policies.
中文摘要:作者提出了EMMOE基准,通过整合连续环境中的复杂任务、新评估指标和用于大语言模型训练的数据集,解决了自主家庭机器人面临的挑战,并展示了其先进智能体系统的优越性能。
English Summary: The authors introduce EMMOE, a benchmark addressing challenges in autonomous home robots by integrating complex tasks in continuous environments with new evaluation metrics and a dataset for LLM training, alongside a sophisticated agent system that demonstrates improved performance.

Authors:Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu
Title: OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
Abstract:
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.
中文:OmniSAM框架通过将全景图分割为序列块并利用SAM2的记忆机制来解决失真和特征连续性问题,同时微调组件并引入原型适应模块以增强语义理解和泛化能力,在多个数据集上实现了显著的性能提升。
English: The OmniSAM framework adapts the Segment Anything Model 2 (SAM2) for panoramic semantic segmentation by dividing panoramas into sequential patches and leveraging SAM2's memory mechanism to address distortion and feature continuity issues, while fine-tuning components and introducing a prototypical adaptation module to enhance semantic understanding and generalization, achieving state-of-the-art performance improvements across multiple datasets.

Authors:Tri Le, Toan Nguyen, Quang Tran, Quang Nguyen, Baoru Huang, Hoan Nguyen, Minh Nhat Vu, Tung D. Ta, Anh Nguyen
Title: RoboDesign1M: A Large-scale Dataset for Robot Design Understanding
Abstract:
Robot design is a complex and time-consuming process that requires specialized expertise. Gaining a deeper understanding of robot design data can enable various applications, including automated design generation, retrieving example designs from text, and developing AI-powered design assistants. While recent advancements in foundation models present promising approaches to addressing these challenges, progress in this field is hindered by the lack of large-scale design datasets. In this paper, we introduce RoboDesign1M, a large-scale dataset comprising 1 million samples. Our dataset features multimodal data collected from scientific literature, covering various robotics domains. We propose a semi-automated data collection pipeline, enabling efficient and diverse data acquisition. To assess the effectiveness of RoboDesign1M, we conduct extensive experiments across multiple tasks, including design image generation, visual question answering about designs, and design image retrieval. The results demonstrate that our dataset serves as a challenging new benchmark for design understanding tasks and has the potential to advance research in this field. RoboDesign1M will be released to support further developments in AI-driven robotic design automation.
中文: 本文提出的RoboDesign1M是一个包含百万样本的大规模多模态数据集,旨在解决设计数据匮乏的问题,通过图像生成与检索等任务推动AI驱动的机器人设计自动化研究发展。
English: RoboDesign1M is introduced as a large-scale multimodal dataset of 1 million samples to overcome the scarcity of design data, enabling advancements in AI-driven robotic design automation through tasks like image generation and retrieval.

Authors:Chenfei Liao, Xu Zheng, Yuanhuiyi Lyu, Haiwei Xue, Yihong Cao, Jiawen Wang, Kailun Yang, Xuming Hu
Title: MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation
Abstract:
Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-modal data? 2. How can SAM2 better understand semantics? Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality-agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi-modal data to capture modality-agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training-only Semantic Prototype Memory Module (SPMM) to store category-level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real-world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.
中文: 本研究提出了MemorySAM,将Segment Anything Model 2 (SAM2)应用于多模态语义分割,通过记忆机制捕捉模态无关特征,并采用语义原型记忆模块提升语义理解能力,在多个基准测试中取得了最先进的性能。
English: This study introduces MemorySAM, an adaptation of the Segment Anything Model 2 (SAM2) for Multi-Modal Semantic Segmentation, which utilizes memory mechanisms to capture modality-agnostic features and a Semantic Prototype Memory Module to enhance semantic understanding, achieving state-of-the-art performance on benchmarks.

Authors:Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, Foutse Khomh
Title: A Taxonomy of Inefficiencies in LLM-Generated Python Code
Abstract:
Large Language Models (LLMs) are widely adopted for automated code generation with promising results. Although prior research has assessed LLM-generated code and identified various quality issues -- such as redundancy, poor maintainability, and sub-optimal performance a systematic understanding and categorization of these inefficiencies remain unexplored. Without such knowledge, practitioners struggle to optimize LLM-generated code for real-world applications, limiting its adoption. This study can also guide improving code LLMs, enhancing the quality and efficiency of code generation. Therefore, in this study, we empirically investigate inefficiencies in LLM-generated code by state-of-the-art models, i.e., CodeLlama, DeepSeek-Coder, and CodeGemma. To do so, we analyze 492 generated code snippets in the HumanEval++ dataset. We then construct a taxonomy of inefficiencies in LLM-generated code that includes 5 categories General Logic, Performance, Readability, Maintainability, and Errors) and 19 subcategories of inefficiencies. We then validate the proposed taxonomy through an online survey with 58 LLM practitioners and researchers. Our study indicates that logic and performance-related inefficiencies are the most popular, relevant, and frequently co-occur and impact overall code quality inefficiency. Our taxonomy provides a structured basis for evaluating the quality LLM-generated code and guiding future research to improve code generation efficiency.
中文摘要:本研究系统调查了大语言模型生成代码中的低效问题,构建了包含5大类19子类的分类体系,为评估代码质量和指导未来研究提供了结构化框架。
English Summary: This study systematically investigates inefficiencies in LLM-generated code, developing a comprehensive taxonomy of 5 categories and 19 subcategories that provides a structured framework for evaluating code quality and guiding future improvements.

Authors:Xudong Lu, Yinghao Chen, Renshou Wu, Haohao Gao, Xi Chen, Xue Yang, Xiangyu Zhao, Aojun Zhou, Fangyuan Li, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Title: GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
中文: 针对移动设备上多模态大语言模型在保持语言能力和硬件兼容性方面的挑战,GenieBlue通过冻结原始LLM参数、复制特定Transformer模块并集成轻量级LoRA模块,在智能手机NPU上实现了语言能力保留与多模态功能的高效协同。
English: Recent MLLM deployments on mobile devices face challenges in maintaining language capabilities and hardware compatibility, addressed by GenieBlue, which freezes original LLM parameters and integrates duplicated transformer blocks with LoRA modules to preserve linguistic skills while enabling multimodal functions efficiently on smartphone NPUs.

Authors:Yuyou Zhang, Yihang Yao, Shiqi Liu, Yaru Niu, Changyi Lin, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding Zhao
Title: QuietPaw: Learning Quadrupedal Locomotion with Versatile Noise Preference Alignment
Abstract:
When operating at their full capacity, quadrupedal robots can produce loud footstep noise, which can be disruptive in human-centered environments like homes, offices, and hospitals. As a result, balancing locomotion performance with noise constraints is crucial for the successful real-world deployment of quadrupedal robots. However, achieving adaptive noise control is challenging due to (a) the trade-off between agility and noise minimization, (b) the need for generalization across diverse deployment conditions, and (c) the difficulty of effectively adjusting policies based on noise requirements. We propose QuietPaw, a framework incorporating our Conditional Noise-Constrained Policy (CNCP), a constrained learning-based algorithm that enables flexible, noise-aware locomotion by conditioning policy behavior on noise-reduction levels. We leverage value representation decomposition in the critics, disentangling state representations from condition-dependent representations and this allows a single versatile policy to generalize across noise levels without retraining while improving the Pareto trade-off between agility and noise reduction. We validate our approach in simulation and the real world, demonstrating that CNCP can effectively balance locomotion performance and noise constraints, achieving continuously adjustable noise reduction.
中文:提出的QuietPaw框架通过条件噪声约束策略,采用价值表征解耦方法,使四足机器人能够在保持运动性能的同时实现可调节的降噪效果。
English: The proposed QuietPaw framework with Conditional Noise-Constrained Policy enables quadrupedal robots to achieve adaptable noise reduction while maintaining locomotion performance through a novel value representation decomposition method.

Authors:Yuhao Wu, Yushi Bai, Zhiqing Hu, Shangqing Tu, Ming Shan Hee, Juanzi Li, Roy Ka-Wei Lee
Title: Shifting Long-Context LLMs Research from Input to Output
Abstract:
Recent advancements in long-context Large Language Models (LLMs) have primarily concentrated on processing extended input contexts, resulting in significant strides in long-context comprehension. However, the equally critical aspect of generating long-form outputs has received comparatively less attention. This paper advocates for a paradigm shift in NLP research toward addressing the challenges of long-output generation. Tasks such as novel writing, long-term planning, and complex reasoning require models to understand extensive contexts and produce coherent, contextually rich, and logically consistent extended text. These demands highlight a critical gap in current LLM capabilities. We underscore the importance of this under-explored domain and call for focused efforts to develop foundational LLMs tailored for generating high-quality, long-form outputs, which hold immense potential for real-world applications.
中文: 本文主张将NLP研究重点转向长文本生成领域,指出当前大语言模型在生成连贯长内容方面的能力缺口,呼吁开发专注于高质量长文本输出的基础模型。
English: This paper calls for shifting NLP research focus toward long-output generation in LLMs, highlighting the current gap in producing coherent extended texts and advocating for developing models specialized in high-quality long-form content.

Authors:Zhengyao Gu, Henry Peng Zou, Yankai Chen, Aiwei Liu, Weizhi Zhang, Philip S. Yu
Title: Scaling Laws for Many-Shot In-Context Learning with Self-Generated Annotations
Abstract:
The high cost of obtaining high-quality annotated data for in-context learning (ICL) has motivated the development of methods that use self-generated annotations in place of ground-truth labels. While these approaches have shown promising results in few-shot settings, they generally do not scale to many-shot scenarios. In this work, we study ICL with self-generated examples using a framework analogous to traditional semi-supervised learning, consisting of annotation generation, demonstration selection, and in-context inference. Within this framework, we propose a simple baseline that outperforms ground-truth ICL in zero-shot, few-shot, and many-shot settings. Notably, we observe a scaling law with this baseline, where optimal performance is achieved with more than 1,000 demonstrations. To fully exploit the many-shot capabilities of semi-supervised ICL, we introduce IterPSD, an iterative annotation approach that integrates iterative refinement and curriculum pseudo-labeling techniques from semi-supervised learning, yielding up to 6.8% additional gains on classification tasks.
中文摘要:本研究提出了一种利用自生成标注的半监督上下文学习框架,其基线方法优于真实标注方法,并引入名为IterPSD的迭代技术,在分类任务上实现了高达6.8%的性能提升。
English Summary: This study introduces a semi-supervised framework for in-context learning using self-generated annotations, proposing a baseline method that surpasses ground-truth approaches and an iterative technique called IterPSD that achieves up to 6.8% improvement on classification tasks.

Authors:Weizhi Zhang, Liangwei Yang, Wooseong Yang, Henry Peng Zou, Yuqing Liu, Ke Xu, Sourav Medya, Philip S. Yu
Title: LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation
Abstract:
Collaborative filtering models, particularly graph-based approaches, have demonstrated strong performance in capturing user-item interactions for recommendation systems. However, they continue to struggle in cold-start and data-sparse scenarios. The emergence of large language models (LLMs) like GPT and LLaMA presents new possibilities for enhancing recommendation performance, especially in cold-start settings. Despite their promise, LLMs pose challenges related to scalability and efficiency due to their high computational demands and limited ability to model complex user-item relationships effectively. In this work, we introduce a novel perspective on leveraging LLMs for CF model initialization. Through experiments, we uncover an embedding collapse issue when scaling CF models to larger embedding dimensions. To effectively harness large-scale LLM embeddings, we propose innovative selective initialization strategies utilizing random, uniform, and variance-based index sampling. Our comprehensive evaluation on multiple real-world datasets demonstrates significant performance gains across various CF models while maintaining a lower computational cost compared to existing LLM-based recommendation approaches.
中文摘要:本研究提出利用大语言模型嵌入初始化协同过滤模型的新方法,通过选择性初始化策略解决嵌入塌陷问题,在显著提升性能的同时有效降低计算成本。
English Summary: This study introduces a novel method for initializing collaborative filtering models using large language model embeddings, addressing embedding collapse through selective initialization strategies that significantly improve performance while reducing computational costs.

Authors:Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, David Lo
Title: Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection
Abstract:
Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.
Recent generative AI advancements have expanded LLMs' use in software engineering, yet a thorough evaluation of their effectiveness in software vulnerability detection across multiple programming languages and strategies is missing, prompting this comprehensive empirical study.
English Summary:

Authors:Deyu Bo, Songhua Liu, Xinchao Wang
Title: Understanding Dataset Distillation via Spectral Filtering
Abstract:
Dataset distillation (DD) has emerged as a promising approach to compress datasets and speed up model training. However, the underlying connections among various DD methods remain largely unexplored. In this paper, we introduce UniDD, a spectral filtering framework that unifies diverse DD objectives. UniDD interprets each DD objective as a specific filter function that affects the eigenvalues of the feature-feature correlation (FFC) matrix and modulates the frequency components of the feature-label correlation (FLC) matrix. In this way, UniDD reveals that the essence of DD fundamentally lies in matching frequency-specific features. Moreover, according to the filter behaviors, we classify existing methods into low-frequency matching and high-frequency matching, encoding global texture and local details, respectively. However, existing methods rely on fixed filter functions throughout distillation, which cannot capture the low- and high-frequency information simultaneously. To address this limitation, we further propose Curriculum Frequency Matching (CFM), which gradually adjusts the filter parameter to cover both low- and high-frequency information of the FFC and FLC matrices. Extensive experiments on small-scale datasets, such as CIFAR-10/100, and large-scale datasets, including ImageNet-1K, demonstrate the superior performance of CFM over existing baselines and validate the practicality of UniDD.
中文摘要:本文提出UniDD谱过滤框架,通过将数据集蒸馏方法解释为频率滤波器来统一它们,并进一步提出课程频率匹配(CFM)方法动态调整滤波器参数以同时捕获低频和高频信息,在多个数据集上验证了其优越性能。
English Summary: This paper introduces UniDD, a spectral filtering framework that unifies dataset distillation methods by interpreting them as frequency filters, and proposes Curriculum Frequency Matching (CFM) to dynamically adjust filter parameters for capturing both low- and high-frequency information, demonstrating superior performance across various datasets.

Authors:Haowen Pan, Xiaozhi Wang, Yixin Cao, Zenglin Shi, Xun Yang, Juanzi Li, Meng Wang
Title: Precise Localization of Memories: A Fine-grained Neuron-level Knowledge Editing Technique for LLMs
Abstract:
Knowledge editing aims to update outdated information in Large Language Models (LLMs). A representative line of study is locate-then-edit methods, which typically employ causal tracing to identify the modules responsible for recalling factual knowledge about entities. However, we find these methods are often sensitive only to changes in the subject entity, leaving them less effective at adapting to changes in relations. This limitation results in poor editing locality, which can lead to the persistence of irrelevant or inaccurate facts, ultimately compromising the reliability of LLMs. We believe this issue arises from the insufficient precision of knowledge localization. To address this, we propose a Fine-grained Neuron-level Knowledge Editing (FiNE) method that enhances editing locality without affecting overall success rates. By precisely identifying and modifying specific neurons within feed-forward networks, FiNE significantly improves knowledge localization and editing. Quantitative experiments demonstrate that FiNE efficiently achieves better overall performance compared to existing techniques, providing new insights into the localization and modification of knowledge within LLMs.
中文: FiNE方法通过精确定位和修改前馈网络中的特定神经元,有效提升了大型语言模型的知识编辑定位能力和准确性,同时不影响整体性能。
English: The FiNE method enhances knowledge editing in LLMs by precisely targeting and modifying specific neurons, improving localization and accuracy without compromising overall performance.

Authors:Quanyu Dai, Jiaren Xiao, Zhaocheng Du, Jieming Zhu, Chengxiao Luo, Xiao-Ming Wu, Zhenhua Dong
Title: MCNet: Monotonic Calibration Networks for Expressive Uncertainty Calibration in Online Advertising
Abstract:
In online advertising, uncertainty calibration aims to adjust a ranking model's probability predictions to better approximate the true likelihood of an event, e.g., a click or a conversion. However, existing calibration approaches may lack the ability to effectively model complex nonlinear relations, consider context features, and achieve balanced performance across different data subsets. To tackle these challenges, we introduce a novel model called Monotonic Calibration Networks, featuring three key designs: a monotonic calibration function (MCF), an order-preserving regularizer, and a field-balance regularizer. The nonlinear MCF is capable of naturally modeling and universally approximating the intricate relations between uncalibrated predictions and the posterior probabilities, thus being much more expressive than existing methods. MCF can also integrate context features using a flexible model architecture, thereby achieving context awareness. The order-preserving and field-balance regularizers promote the monotonic relationship between adjacent bins and the balanced calibration performance on data subsets, respectively. Experimental results on both public and industrial datasets demonstrate the superior performance of our method in generating well-calibrated probability predictions.
中文: 提出的单调校准网络通过非线性单调校准函数增强表达能力,结合上下文特征,并采用保持顺序和领域平衡的正则化器,有效解决了现有方法在复杂非线性关系建模和子集平衡性方面的不足,在多个数据集上实现了更优的校准效果。
English: The proposed Monotonic Calibration Networks address limitations in existing calibration methods by incorporating a nonlinear monotonic calibration function for enhanced expressiveness, context-aware feature integration, and dual regularizers to ensure monotonicity and balanced performance across data subsets, demonstrating superior calibration on diverse datasets.

Authors:Anwesa Choudhuri, Zhongpai Gao, Meng Zheng, Benjamin Planche, Terrence Chen, Ziyan Wu
Title: PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis
Abstract:
Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.
Chinese: PolypSegTrack是一种新型基础模型,可在结肠镜视频中联合实现息肉检测、分割、分类和无监督跟踪,无需任务特定微调或领域特定预训练,并在多个基准测试中显著优于现有方法。
English: PolypSegTrack is a novel foundation model that integrates polyp detection, segmentation, classification, and unsupervised tracking in colonoscopic videos, eliminating the need for task-specific fine-tuning or domain-specific pre-training while outperforming existing methods across multiple benchmarks.

Authors:Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai
Title: TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes
Abstract:
This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.
Chinese: 本文提出TextCrafter这一复杂视觉文本生成新方法,通过分解文本组件和增强对齐机制解决模糊与遗漏问题,并建立CVTG-2K基准数据集验证其超越现有技术的优越性能。
English: This paper introduces TextCrafter, a novel method for Complex Visual Text Generation that decomposes text components and enhances alignment to address issues like blurriness and omissions, while also proposing the CVTG-2K benchmark to validate its superior performance over existing approaches.

Authors:Yuan He, Bailan He, Zifeng Ding, Alisia Lupidi, Yuqicheng Zhu, Shuo Chen, Caiqi Zhang, Jiaoyan Chen, Yunpu Ma, Volker Tresp, Ian Horrocks
Title: Supposedly Equivalent Facts That Aren't? Entity Frequency in Pre-training Induces Asymmetry in LLMs
Abstract:
Understanding and mitigating hallucinations in Large Language Models (LLMs) is crucial for ensuring reliable content generation. While previous research has primarily focused on "when" LLMs hallucinate, our work explains "why" and directly links model behaviour to the pre-training data that forms their prior knowledge. Specifically, we demonstrate that an asymmetry exists in the recognition of logically equivalent facts, which can be attributed to frequency discrepancies of entities appearing as subjects versus objects. Given that most pre-training datasets are inaccessible, we leverage the fully open-source OLMo series by indexing its Dolma dataset to estimate entity frequencies. Using relational facts (represented as triples) from Wikidata5M, we construct probing datasets to isolate this effect. Our experiments reveal that facts with a high-frequency subject and a low-frequency object are better recognised than their inverse, despite their logical equivalence. The pattern reverses in low-to-high frequency settings, and no statistically significant asymmetry emerges when both entities are high-frequency. These findings highlight the influential role of pre-training data in shaping model predictions and provide insights for inferring the characteristics of pre-training data in closed or partially closed LLMs.
中文摘要:本研究揭示了大型语言模型产生幻觉的原因在于预训练数据中实体作为主语与宾语的频率差异导致对逻辑等价事实的识别不对称,为推断封闭或半封闭模型中训练数据的特性提供了新视角。
English Summary: This research uncovers that hallucinations in LLMs stem from asymmetrical recognition of logically equivalent facts due to frequency disparities of entities as subjects versus objects in pre-training data, offering insights into inferring characteristics of inaccessible training datasets.

Authors:Yuhan Liu, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation
Abstract:
Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.
中文: 跨域少样本分割因域偏移影响低级特征导致性能下降,为此我们提出了两个即插即用模块,在1样本和5样本场景下显著提升了分割精度。
English: Cross-Domain Few-Shot Segmentation faces performance decline due to domain shifts affecting low-level features, prompting the proposal of two plug-and-play modules that significantly enhance segmentation accuracy in both 1-shot and 5-shot settings.

Authors:Yulu Han, Ziye Jia, Sijie He, Yu Zhang, Qihui Wu
Title: CNN+Transformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue
Abstract:
The unmanned aerial vehicle (UAV) network has gained significant attentions in recent years due to its various applications. However, the traffic security becomes the key threatening public safety issue in an emergency rescue system due to the increasing vulnerability of UAVs to cyber attacks in environments with high heterogeneities. Hence, in this paper, we propose a novel anomaly traffic detection architecture for UAV networks based on the software-defined networking (SDN) framework and blockchain technology. Specifically, SDN separates the control and data plane to enhance the network manageability and security. Meanwhile, the blockchain provides decentralized identity authentication and data security records. Beisdes, a complete security architecture requires an effective mechanism to detect the time-series based abnormal traffic. Thus, an integrated algorithm combining convolutional neural networks (CNNs) and Transformer (CNN+Transformer) for anomaly traffic detection is developed, which is called CTranATD. Finally, the simulation results show that the proposed CTranATD algorithm is effective and outperforms the individual CNN, Transformer, and LSTM algorithms for detecting anomaly traffic.
中文: 本文提出了一种基于软件定义网络和区块链技术的无人机网络异常流量检测架构,并开发了名为CTranATD的CNN+Transformer融合算法,实验证明该算法在异常流量检测方面优于传统方法。
English: This paper introduces a novel anomaly traffic detection architecture for UAV networks, combining SDN and blockchain technologies with a CNN+Transformer algorithm called CTranATD, which demonstrates superior performance in detecting cyber threats compared to existing methods.

Authors:Hongye Cao, Fan Feng, Jing Huo, Shangdong Yang, Meng Fang, Tianpei Yang, Yang Gao
Title: Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation
Abstract:
Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.
中文摘要:MORAL通过对抗性数据增强技术动态扩展离线强化学习的训练数据,利用集成模型交替采样提升策略鲁棒性,在D4RL基准测试中显著优于现有模型基础离线强化学习方法。
English Summary: MORAL introduces adversarial data augmentation in model-based offline RL to dynamically enrich training data through ensemble model sampling, improving policy robustness and outperforming existing methods on the D4RL benchmark.

Authors:Gianluca Giacomelli, Simone Formentin, Victor G. Lopez, Matthias A. Müller, Valentina Breschi
Title: Insights into the explainability of Lasso-based DeePC for nonlinear systems
Abstract:
Data-enabled Predictive Control (DeePC) has recently gained the spotlight as an easy-to-use control technique that allows for constraint handling while relying on raw data only. Initially proposed for linear time-invariant systems, several DeePC extensions are now available to cope with nonlinear systems. Nonetheless, these solutions mainly focus on ensuring the controller's effectiveness, overlooking the explainability of the final result. As a step toward explaining the outcome of DeePC for the control of nonlinear systems, in this paper, we focus on analyzing the earliest and simplest DeePC approach proposed to cope with nonlinearities in the controlled system, using a Lasso regularization. Our theoretical analysis highlights that the decisions undertaken by DeePC with Lasso regularization are unexplainable, as control actions are determined by data incoherent with the system's local behavior. This result is true even when the available input/output samples are grouped according to the different operating conditions explored during data collection. Our numerical study confirms these findings, highlighting the benefits of data grouping in terms of performance while showing that explainability remains a challenge in control design via DeePC.
中文: 数据驱动预测控制(DeePC)是一种易于使用的非线性系统控制技术,但采用Lasso正则化时,由于数据与系统局部行为不一致,其决策不可解释,即使数据分组也无济于事。
English: Data-enabled Predictive Control (DeePC) is an easy-to-use control technique for nonlinear systems, but its decisions with Lasso regularization are unexplainable due to data incoherence with local system behavior, even with grouped data.

Authors:Matthias Bentert, Fedor V. Fomin, Petr A. Golovach, M. S. Ramanujan, Saket Saurabh
Title: When Distances Lie: Euclidean Embeddings in the Presence of Outliers and Distance Violations
Abstract:
Distance geometry explores the properties of distance spaces that can be exactly represented as the pairwise Euclidean distances between points in $\mathbb{R}^d$ ($d \geq 1$), or equivalently, distance spaces that can be isometrically embedded in $\mathbb{R}^d$. In this work, we investigate whether a distance space can be isometrically embedded in $\mathbb{R}^d$ after applying a limited number of modifications. Specifically, we focus on two types of modifications: outlier deletion (removing points) and distance modification (adjusting distances between points). The central problem, Euclidean Embedding Editing (EEE), asks whether an input distance space on $n$ points can be transformed, using at most $k$ modifications, into a space that is isometrically embeddable in $\mathbb{R}^d$. We present several fixed-parameter tractable (FPT) and approximation algorithms for this problem. Our first result is an algorithm that solves EEE in time $(dk)^{\mathcal{O}(d+k)} + n^{\mathcal{O}(1)}$. The core subroutine of this algorithm, which is of independent interest, is a polynomial-time method for compressing the input distance space into an equivalent instance of EEE with $\mathcal{O}((dk)^2)$ points. For the special but important case of EEE where only outlier deletions are allowed, we improve the parameter dependence of the FPT algorithm and obtain a running time of $\min\{(d+3)^k, 2^{d+k}\} \cdot n^{\mathcal{O}(1)}$. Additionally, we provide an FPT-approximation algorithm for this problem, which outputs a set of at most $2 \cdot {\rm OPT}$ outliers in time $2^d \cdot n^{\mathcal{O}(1)}$. This 2-approximation algorithm improves upon the previous $(3+\varepsilon)$-approximation algorithm by Sidiropoulos, Wang, and Wang [SODA '17]. Furthermore, we complement our algorithms with hardness results motivating our choice of parameterizations.
中文: 本研究探讨欧几里得嵌入编辑问题,旨在确定距离空间能否通过有限次修改(如删除异常点或调整距离)实现到ℝᵈ的等距嵌入,并提出了具有更高效率的固定参数可解算法与近似算法。
English: This study investigates the Euclidean Embedding Editing problem, which determines whether a distance space can be made isometrically embeddable in ℝᵈ through limited modifications like outlier deletion or distance adjustments, presenting FPT and approximation algorithms with improved efficiency.

Authors:Luca Zanella, Massimiliano Mancini, Willi Menapace, Sergey Tulyakov, Yiming Wang, Elisa Ricci
Title: Can Text-to-Video Generation help Video-Language Alignment?
Abstract:
Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.
Recent video-language models face linguistic bias from negative captions, which SynViTA addresses by using dynamically weighted synthetic videos and a semantic consistency loss to improve alignment without relying on scarce real video data.
English Summary:

Authors:Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, Yi Yang
Title: DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery
Abstract:
Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.
Chinese: DroneSplat是一种创新框架,通过自适应消除动态干扰物并在有限视角下优化几何捕捉,显著提升了无人机图像的三维重建效果,在野外环境中优于现有方法。
English: DroneSplat is a novel framework that enhances 3D reconstruction from drone imagery by adaptively removing dynamic distractors and improving geometry capture under limited views, outperforming existing methods in wild environments.

Authors:Yujia Zheng, Yang Liu, Jiaxiong Yao, Yingyao Hu, Kun Zhang
Title: Nonparametric Factor Analysis and Beyond
Abstract:
Nearly all identifiability results in unsupervised representation learning inspired by, e.g., independent component analysis, factor analysis, and causal representation learning, rely on assumptions of additive independent noise or noiseless regimes. In contrast, we study the more general case where noise can take arbitrary forms, depend on latent variables, and be non-invertibly entangled within a nonlinear function. We propose a general framework for identifying latent variables in the nonparametric noisy settings. We first show that, under suitable conditions, the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise. Furthermore, under the structural or distributional variability conditions, we prove that latent variables of the general nonlinear models are identifiable up to trivial indeterminacies. Based on the proposed theoretical framework, we have also developed corresponding estimation methods and validated them in various synthetic and real-world settings. Interestingly, our estimate of the true GDP growth from alternative measurements suggests more insightful information on the economies than official reports. We expect our framework to provide new insight into how both researchers and practitioners deal with latent variables in real-world scenarios.
中文: 本研究提出了一个在任意噪声非线性模型中识别潜在变量的通用框架,证明了特定条件下的可识别性,并通过GDP增长估计等实际应用验证了其现实价值。
English: This study introduces a general framework for identifying latent variables in nonlinear models with arbitrary noise, proving identifiability under specific conditions and demonstrating its practical value through real-world applications like GDP growth estimation.

Authors:Xinlong Zhai, Chunchen Wang, Ruijia Wang, Jiazheng Kang, Shujie Li, Boyu Chen, Tengfei Ma, Zikai Zhou, Cheng Yang, Chuan Shi
Title: Blend the Separated: Mixture of Synergistic Experts for Data-Scarcity Drug-Target Interaction Prediction
Abstract:
Drug-target interaction prediction (DTI) is essential in various applications including drug discovery and clinical application. There are two perspectives of input data widely used in DTI prediction: Intrinsic data represents how drugs or targets are constructed, and extrinsic data represents how drugs or targets are related to other biological entities. However, any of the two perspectives of input data can be scarce for some drugs or targets, especially for those unpopular or newly discovered. Furthermore, ground-truth labels for specific interaction types can also be scarce. Therefore, we propose the first method to tackle DTI prediction under input data and/or label scarcity. To make our model functional when only one perspective of input data is available, we design two separate experts to process intrinsic and extrinsic data respectively and fuse them adaptively according to different samples. Furthermore, to make the two perspectives complement each other and remedy label scarcity, two experts synergize with each other in a mutually supervised way to exploit the enormous unlabeled data. Extensive experiments on 3 real-world datasets under different extents of input data scarcity and/or label scarcity demonstrate our model outperforms states of the art significantly and steadily, with a maximum improvement of 53.53%. We also test our model without any data scarcity and it still outperforms current methods.
中文: 该方法通过设计两个分别处理内在和外在数据的专家模块,在相互监督下协同互补并利用未标记数据,有效解决了输入数据和标签稀缺情况下的药物-靶点相互作用预测问题,实验证明其性能显著优于现有方法。
English: The proposed method addresses drug-target interaction prediction under conditions of scarce input data and labels by designing two experts that process intrinsic and extrinsic data separately, synergizing through mutual supervision to complement each other and leverage unlabeled data, achieving significant performance improvements in experiments.

Authors:Arindam Dutta, Meng Zheng, Zhongpai Gao, Benjamin Planche, Anwesha Choudhuri, Terrence Chen, Amit K. Roy-Chowdhury, Ziyan Wu
Title: CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
Abstract:
Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.
中文: CHROME是一种新颖的流程,无需几何先验即可从单张遮挡图像重建3D着装人体,它利用多视角扩散模型生成一致的无遮挡视图,并通过3D高斯模型实现精确重建。
English: CHROME is a novel pipeline that reconstructs 3D clothed humans from a single occluded image without needing geometric priors, using a multiview diffusion model to generate consistent occlusion-free views and 3D Gaussians for accurate reconstruction.

Authors:Hao Liang, Zhipeng Dong, Kaixin Chen, Jiyuan Guo, Yufeng Yue, Yi Yang, Mengyin Fu
Title: ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents
Abstract:
Surround-view perception has garnered significant attention for its ability to enhance the perception capabilities of autonomous driving vehicles through the exchange of information with surrounding cameras. However, existing surround-view perception systems are limited by inefficiencies in unidirectional interaction pattern with human and distortions in overlapping regions exponentially propagating into non-overlapping areas. To address these challenges, this paper introduces ChatStitch, a surround-view human-machine co-perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To dismantle the unidirectional interaction bottleneck, ChatStitch implements a cognitively grounded closed-loop interaction multi-agent framework based on Large Language Models. To suppress distortion propagation across overlapping boundaries, ChatStitch proposes SV-UDIS, a surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9\%, 17\%, and 21\%, and SSIM improvements of 8\%, 18\%, and 26\%, respectively.
Chinese: 本文提出ChatStitch环视人机协同感知系统,通过基于大语言模型的闭环交互框架打破单向交互瓶颈,并创新性地提出SV-UDIS非全局重叠条件下的无监督深度图像拼接方法,在多个数据集上实现了最先进的性能表现。
English: This paper presents ChatStitch, a surround-view human-machine co-perception system that overcomes limitations in existing systems by implementing a closed-loop interaction framework using Large Language Models and introducing SV-UDIS, a novel unsupervised deep image stitching method that achieves state-of-the-art performance on multiple datasets.

Authors:Hang Li, Xiao Wang, Bevan Koopman, Guido Zuccon
Title: Pseudo Relevance Feedback is Enough to Close the Gap Between Small and Large Dense Retrieval Models
Abstract:
Scaling dense retrievers to larger large language model (LLM) backbones has been a dominant strategy for improving their retrieval effectiveness. However, this has substantial cost implications: larger backbones require more expensive hardware (e.g. GPUs with more memory) and lead to higher indexing and querying costs (latency, energy consumption). In this paper, we challenge this paradigm by introducing PromptPRF, a feature-based pseudo-relevance feedback (PRF) framework that enables small LLM-based dense retrievers to achieve effectiveness comparable to much larger models. PromptPRF uses LLMs to extract query-independent, structured and unstructured features (e.g., entities, summaries, chain-of-thought keywords, essay) from top-ranked documents. These features are generated offline and integrated into dense query representations via prompting, enabling efficient retrieval without additional training. Unlike prior methods such as GRF, which rely on online, query-specific generation and sparse retrieval, PromptPRF decouples feedback generation from query processing and supports dense retrievers in a fully zero-shot setting. Experiments on TREC DL and BEIR benchmarks demonstrate that PromptPRF consistently improves retrieval effectiveness and offers favourable cost-effectiveness trade-offs. We further present ablation studies to understand the role of positional feedback and analyse the interplay between feature extractor size, PRF depth, and model performance. Our findings demonstrate that with effective PRF design, scaling the retriever is not always necessary, narrowing the gap between small and large models while reducing inference cost.
中文: PromptPRF是一种基于特征的伪相关反馈框架,通过离线生成文档特征集成到密集检索中,使小型模型无需扩大规模即可达到与大型模型相当的检索效果。
English: PromptPRF is a pseudo-relevance feedback framework that enhances small dense retrievers by incorporating offline-generated document features, achieving effectiveness comparable to larger models without scaling up LLM backbones.

Authors:Yuhao Qiu, Shuyan Bai, Tingfa Xu, Peifu Liu, Haolin Qin, Jianan Li
Title: HSOD-BIT-V2: A New Challenging Benchmarkfor Hyperspectral Salient Object Detection
Abstract:
Salient Object Detection (SOD) is crucial in computer vision, yet RGB-based methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets. In this context, we introduce HSOD-BIT-V2, the largest and most challenging HSOD benchmark dataset to date. Five distinct challenges focusing on small objects and foreground-background similarity are designed to emphasize spectral advantages and real-world complexity. To tackle these challenges, we propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios.
Chinese: 高光谱显著目标检测(HSOD)通过引入迄今最大的基准数据集HSOD-BIT-V2和Hyper-HRNet高分辨率网络得以提升,该网络有效利用光谱信息,在小目标和前景背景相似等复杂场景中显著提高了检测精度。
English: Hyperspectral Salient Object Detection (HSOD) is enhanced by the introduction of HSOD-BIT-V2, the largest benchmark dataset, and Hyper-HRNet, a high-resolution network that effectively utilizes spectral information to improve detection accuracy, particularly in challenging scenarios involving small objects and similar foreground-background features.

Authors:Ruibo Wang, Mustafa A. Kishk, Mohamed-Slim Alouini
Title: Modeling and Analysis of Non-Terrestrial Networks by Spherical Stochastic Geometry
Abstract:
Non-terrestrial networks (NTNs) are anticipated to be indispensable in extending coverage and enabling global communication access in next-generation wireless networks. With the extensive deployment of non-terrestrial platforms, evaluating the performance of NTN-enabled communication systems becomes a challenging task. Spherical stochastic geometry (SG) is a recently proposed analytical framework that has garnered increasing attention. Due to its suitability for modeling large-scale dynamic topologies and its ability to provide an analytical framework for interference analysis and low-complexity performance evaluation, spherical SG has been widely applied in NTN performance analysis. This paper surveys the modeling and analysis of NTN networks based on spherical SG. We begin by introducing the spherical SG framework, detailing its history and development. Next, we categorize existing spherical SG models into three types based on orbital modeling methods and provide algorithm implementations for common models. Furthermore, we investigate the accuracy and necessity of spherical modeling through case studies. On the topology level, concepts such as association strategy, central angle, zenith angle, contact angle, and availability probability are introduced, with simple derivations provided. On the channel level, we detail the modeling of large-scale fading, small-scale fading, and beam gain for different channel links. Finally, we discuss several advanced topics that have not been fully explored but have strong motivation and research potential, and we predict future research directions.
中文: 球面随机几何为建模大规模动态拓扑和评估非地面网络性能提供了分析框架,本综述详细介绍了其应用、模型分类及未来研究方向。
English: Spherical stochastic geometry provides an analytical framework for modeling large-scale dynamic topologies and evaluating the performance of non-terrestrial networks, with this survey detailing its applications, model classifications, and future research directions.

Authors:Dazhou Guo, Zhanghexuan Ji, Yanzhou Su, Dandan Zheng, Heng Guo, Puyang Wang, Ke Yan, Yirui Wang, Qinji Yu, Zi Li, Minfeng Xu, Jianfeng Zhang, Haoshen Li, Jia Ge, Tsung-Ying Ho, Bing-Shen Huang, Tashan Ai, Kuaile Zhao, Na Shen, Qifeng Wang, Yun Bian, Tingyu Wu, Peng Du, Hua Zhang, Feng-Ming Kong, Alan L. Yuille, Cher Heng Tan, Chunyan Miao, Perry J. Pickhardt, Senxiang Yan, Ronald M. Summers, Le Lu, Dakai Jin, Xianghua Ye
Title: A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT
Abstract:
Precision medicine in the quantitative management of chronic diseases and oncology would be greatly improved if the Computed Tomography (CT) scan of any patient could be segmented, parsed and analyzed in a precise and detailed way. However, there is no such fully annotated CT dataset with all anatomies delineated for training because of the exceptionally high manual cost, the need for specialized clinical expertise, and the time required to finish the task. To this end, we proposed a novel continual learning-driven CT model that can segment complete anatomies presented using dozens of previously partially labeled datasets, dynamically expanding its capacity to segment new ones without compromising previously learned organ knowledge. Existing multi-dataset approaches are not able to dynamically segment new anatomies without catastrophic forgetting and would encounter optimization difficulty or infeasibility when segmenting hundreds of anatomies across the whole range of body regions. Our single unified CT segmentation model, CL-Net, can highly accurately segment a clinically comprehensive set of 235 fine-grained whole-body anatomies. Composed of a universal encoder, multiple optimized and pruned decoders, CL-Net is developed using 13,952 CT scans from 20 public and 16 private high-quality partially labeled CT datasets of various vendors, different contrast phases, and pathologies. Extensive evaluation demonstrates that CL-Net consistently outperforms the upper limit of an ensemble of 36 specialist nnUNets trained per dataset with the complexity of 5% model size and significantly surpasses the segmentation accuracy of recent leading Segment Anything-style medical image foundation models by large margins. Our continual learning-driven CL-Net model would lay a solid foundation to facilitate many downstream tasks of oncology and chronic diseases using the most widely adopted CT imaging.
中文:提出的CL-Net模型采用持续学习技术,能够从多个部分标注的CT数据集中精确分割235个全身解剖结构,克服了现有方法的局限性,并以更高效率显著超越了专业模型的性能。
English: The proposed CL-Net model uses continual learning to accurately segment 235 whole-body anatomies from multiple partially labeled CT datasets, overcoming limitations of existing methods and significantly outperforming specialized models with greater efficiency.

Authors:Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, Vaneet Aggarwal
Title: BalancedDPO: Adaptive Multi-Metric Alignment
Abstract:
Text-to-image (T2I) diffusion models have made remarkable advancements, yet aligning them with diverse preferences remains a persistent challenge. Current methods often optimize single metrics or depend on narrowly curated datasets, leading to overfitting and limited generalization across key visual quality metrics. We present BalancedDPO, a novel extension of Direct Preference Optimization (DPO) that addresses these limitations by simultaneously aligning T2I diffusion models with multiple metrics, including human preference, CLIP score, and aesthetic quality. Our key novelty lies in aggregating consensus labels from diverse metrics in the preference distribution space as compared to existing reward mixing approaches, enabling robust and scalable multi-metric alignment while maintaining the simplicity of the standard DPO pipeline that we refer to as BalancedDPO. Our evaluations on the Pick-a-Pic, PartiPrompt and HPD datasets show that BalancedDPO achieves state-of-the-art results, outperforming existing approaches across all major metrics. BalancedDPO improves the average win rates by 15%, 7.1%, and 10.3% on Pick-a-pic, PartiPrompt and HPD, respectively, from the DiffusionDPO.
Chinese: BalancedDPO是直接偏好优化的创新扩展,通过同时对齐文本到图像扩散模型的多个指标,在关键数据集上实现了最优性能,相比现有方法显著提高了胜率。
English: BalancedDPO is a novel extension of Direct Preference Optimization that aligns text-to-image diffusion models with multiple metrics simultaneously, achieving state-of-the-art performance across key datasets by improving win rates significantly over existing methods.

Authors:Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, Donghong Ji
Title: Multi-Granular Multimodal Clue Fusion for Meme Understanding
Abstract:
With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.
Chinese: 提出的多粒度多模态线索融合模型(MGMCF)通过提取细粒度视觉特征和增强图文关联性,解决了多模态表情包理解中的现有局限,在MET-MEME数据集上的多项任务中实现了显著性能提升。
English: The proposed multi-granular multimodal clue fusion model (MGMCF) addresses limitations in multimodal meme understanding by extracting fine-grained visual features and enhancing text-image correlation, achieving significant performance improvements across multiple tasks on the MET-MEME dataset.

Authors:Wei Lai, Tianyu Ding, ren dongdong, Lei Wang, Jing Huo, Yang Gao, Wenbin Li
Title: Robust Dataset Distillation by Matching Adversarial Trajectories
Abstract:
Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset distillation", a novel paradigm that embeds adversarial robustness into the synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process. As experimentally demonstrated, even through natural training on our distilled dataset, models can achieve enhanced adversarial robustness while maintaining competitive accuracy compared to existing distillation methods. Our work highlights robust dataset distillation as a new and important research direction and provides a strong baseline for future research to bridge the gap between efficient training and adversarial robustness.
中文摘要:本文提出鲁棒数据集蒸馏的新范式,通过MAT方法将对抗训练融入蒸馏过程,使模型在保持高精度的同时显著提升对抗鲁棒性,为高效训练与安全性的统一开辟了新方向。
English Summary: This paper introduces robust dataset distillation, a novel approach that embeds adversarial robustness into synthetic datasets through the proposed MAT method, enabling models trained on distilled data to achieve both high accuracy and enhanced resilience against adversarial attacks.

Authors:Chao Zhou, Changsheng You, Beixiong Zheng, Xiaodan Shao, Rui Zhang
Title: Rotatable Antennas for Integrated Sensing and Communications
Abstract:
In this letter, we propose to deploy rotatable antennas (RAs) at the base station (BS) to enhance both communication and sensing (C&S) performances, by exploiting a new spatial degree-of-freedom (DoF) offered by array rotation. Specifically, we formulate a multi-objective optimization problem to simultaneously maximize the sum-rate of multiple communication users and minimize the Cramér-Rao bound (CRB) for target angle estimation, by jointly optimizing the transmit beamforming vectors and the array rotation angle at the BS. To solve this problem, we first equivalently decompose it into two subproblems, corresponding to an inner problem for beamforming optimization and an outer problem for array rotation optimization. Although these two subproblems are non-convex, we obtain their high-quality solutions by applying the block coordinate descent (BCD) technique and one-dimensional exhaustive search, respectively. Moreover, we show that for the communication-only case, RAs provide an additional rotation gain to improve communication performance; while for the sensing-only case, the equivalent spatial aperture can be enlarged by RAs for achieving higher sensing accuracy. Finally, numerical results are presented to showcase the performance gains of RAs over fixed-rotation antennas in integrated sensing and communications (ISAC).
中文: 本文提出在基站部署可旋转天线,通过联合优化波束成形和阵列旋转角度来提升通信与感知性能,数值结果表明其相比固定天线具有显著优势。
English: This letter proposes using rotatable antennas at base stations to enhance communication and sensing by optimizing beamforming and rotation angles, achieving performance gains over fixed antennas through multi-objective optimization and numerical validation.

Authors:Jiaqi Sun, Yujia Zheng, Xinshuai Dong, Haoyue Dai, Kun Zhang
Title: Type Information-Assisted Self-Supervised Knowledge Graph Denoising
Abstract:
Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.
知识图谱常含有噪声,我们提出的新型自监督方法通过利用实体与关系间的类型一致性来检测错误,有效避免了实体对齐不完善和结构过拟合等问题。
Knowledge graphs are often noisy, and our new self-supervised method detects errors by leveraging type consistency between entities and relations, avoiding issues like imperfect alignment and structural overfitting.

Authors:Alberto Caron, Vasilios Mavroudis, Chris Hicks
Title: Towards Causal Model-Based Policy Optimization
Abstract:
Real-world decision-making problems are often marked by complex, uncertain dynamics that can shift or break under changing conditions. Traditional Model-Based Reinforcement Learning (MBRL) approaches learn predictive models of environment dynamics from queried trajectories and then use these models to simulate rollouts for policy optimization. However, such methods do not account for the underlying causal mechanisms that govern the environment, and thus inadvertently capture spurious correlations, making them sensitive to distributional shifts and limiting their ability to generalize. The same naturally holds for model-free approaches. In this work, we introduce Causal Model-Based Policy Optimization (C-MBPO), a novel framework that integrates causal learning into the MBRL pipeline to achieve more robust, explainable, and generalizable policy learning algorithms. Our approach centers on first inferring a Causal Markov Decision Process (C-MDP) by learning a local Structural Causal Model (SCM) of both the state and reward transition dynamics from trajectories gathered online. C-MDPs differ from classic MDPs in that we can decompose causal dependencies in the environment dynamics via specifying an associated Causal Bayesian Network. C-MDPs allow for targeted interventions and counterfactual reasoning, enabling the agent to distinguish between mere statistical correlations and causal relationships. The learned SCM is then used to simulate counterfactual on-policy transitions and rewards under hypothetical actions (or ``interventions"), thereby guiding policy optimization more effectively. The resulting policy learned by C-MBPO can be shown to be robust to a class of distributional shifts that affect spurious, non-causal relationships in the dynamics. We demonstrate this through some simple experiments involving near and far OOD dynamics drifts.
中文摘要:C-MBPO是一个将因果学习融入基于模型的强化学习的新框架,通过从轨迹中推断因果马尔可夫决策过程,利用反事实推理和干预区分因果关系与伪相关,从而实现更鲁棒的政策优化。
English Summary: C-MBPO is a novel framework that integrates causal learning into Model-Based Reinforcement Learning by inferring a Causal Markov Decision Process, enabling robust policy optimization through counterfactual reasoning and interventions to distinguish causal relationships from spurious correlations.

Authors:Zhe Xu, Daoyuan Chen, Zhenqing Ling, Yaliang Li, Ying Shen
Title: MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
Abstract:
Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.
中文: MindGYM提出了一种以思维为中心的数据合成框架,通过自生成、认知引导的数据增强大型基础模型的推理能力,仅用少量数据和资源即在多项推理基准上取得显著性能提升。
English: MindGYM introduces a thinking-centric data synthesis framework that enhances large foundation models' reasoning abilities through self-generated, cognitively guided data, achieving significant performance gains on reasoning benchmarks with minimal data and resources.

Authors:Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
Title: Exploring Bias in over 100 Text-to-Image Generative Models
Abstract:
We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models
中文: 本研究分析了文本到图像生成模型的偏见演变趋势,发现艺术和风格转换模型存在显著偏见,而基础模型偏见逐渐减少,并提出了大规模评估语料库以促进伦理人工智能发展。
English: This study examines the evolving bias in text-to-image generative models, revealing that artistic and style-transferred models show significant bias while foundation models are becoming less biased, and proposes a large-scale evaluation corpus to support ethical AI development.

Authors:Zhongpai Gao, Benjamin Planche, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Ziyan Wu
Title: 7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting
Abstract:
Real-time rendering of dynamic scenes with view-dependent effects remains a fundamental challenge in computer graphics. While recent advances in Gaussian Splatting have shown promising results separately handling dynamic scenes (4DGS) and view-dependent effects (6DGS), no existing method unifies these capabilities while maintaining real-time performance. We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Our key contribution is an efficient conditional slicing mechanism that transforms 7D Gaussians into view- and time-conditioned 3D Gaussians, maintaining compatibility with existing 3D Gaussian Splatting pipelines while enabling joint optimization. Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering (401 FPS) on challenging dynamic scenes with complex view-dependent effects. The project page is: https://gaozhongpai.github.io/7dgs/.
Chinese: 7D高斯泼溅(7DGS)提出了一个统一框架,通过七维高斯表示和高效条件切片机制,在保持实时渲染(401 FPS)的同时,对具有复杂视角相关效果的动态场景实现了比现有方法最高提升7.36 dB的PSNR性能。
English: 7D Gaussian Splatting (7DGS) introduces a unified framework that efficiently renders dynamic scenes with view-dependent effects in real-time, outperforming previous methods by up to 7.36 dB in PSNR while achieving 401 FPS.

Authors:Qinji Yu, Yirui Wang, Ke Yan, Dandan Zheng, Dashan Ai, Dazhou Guo, Zhanghexuan Ji, Yanzhou Su, Yun Bian, Na Shen, Xiaowei Ding, Le Lu, Xianghua Ye, Dakai Jin
Title: From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans
Abstract:
Lymph node (LN) assessment is an essential task in the routine radiology workflow, providing valuable insights for cancer staging, treatment planning and beyond. Identifying scatteredly-distributed and low-contrast LNs in 3D CT scans is highly challenging, even for experienced clinicians. Previous lesion and LN detection methods demonstrate effectiveness of 2.5D approaches (i.e, using 2D network with multi-slice inputs), leveraging pretrained 2D model weights and showing improved accuracy as compared to separate 2D or 3D detectors. However, slice-based 2.5D detectors do not explicitly model inter-slice consistency for LN as a 3D object, requiring heuristic post-merging steps to generate final 3D LN instances, which can involve tuning a set of parameters for each dataset. In this work, we formulate 3D LN detection as a tracking task and propose LN-Tracker, a novel LN tracking transformer, for joint end-to-end detection and 3D instance association. Built upon DETR-based detector, LN-Tracker decouples transformer decoder's query into the track and detection groups, where the track query autoregressively follows previously tracked LN instances along the z-axis of a CT scan. We design a new transformer decoder with masked attention module to align track query's content to the context of current slice, meanwhile preserving detection query's high accuracy in current slice. An inter-slice similarity loss is introduced to encourage cohesive LN association between slices. Extensive evaluation on four lymph node datasets shows LN-Tracker's superior performance, with at least 2.7% gain in average sensitivity when compared to other top 3D/2.5D detectors. Further validation on public lung nodule and prostate tumor detection tasks confirms the generalizability of LN-Tracker as it achieves top performance on both tasks.
中文摘要:LN-Tracker是一种创新的基于Transformer的方法,将三维淋巴结检测构建为跟踪任务,通过端到端的联合检测与三维实例关联,在多个医学影像数据集上实现了优越性能。
English Summary: LN-Tracker is a novel transformer-based method that frames 3D lymph node detection as a tracking task, enabling joint end-to-end detection and 3D instance association with superior performance across multiple medical imaging datasets.

Authors:Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu
Title: Anatomy-Aware Conditional Image-Text Retrieval
Abstract:
Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.
中文: 本文提出了一种基于解剖位置条件的图文检索框架,通过全局与区域视觉-文本特征对齐来改进医疗病例检索,在定位和多模态检索任务中均实现了优异性能。
English: This paper introduces an Anatomical Location-Conditioned Image-Text Retrieval framework that enhances medical case retrieval by aligning global and regional visual-textual features, achieving superior localization and multimodal retrieval performance.

Authors:Haosen Zhang, Jiahao Huang, Yinzhe Wu, Congren Dai, Fanwen Wang, Zhenxuan Zhang, Guang Yang
Title: Lightweight Hypercomplex MRI Reconstruction: A Generalized Kronecker-Parameterized Approach
Abstract:
Magnetic Resonance Imaging (MRI) is crucial for clinical diagnostics but is hindered by prolonged scan times. Current deep learning models enhance MRI reconstruction but are often memory-intensive and unsuitable for resource-limited systems. This paper introduces a lightweight MRI reconstruction model leveraging Kronecker-Parameterized Hypercomplex Neural Networks to achieve high performance with reduced parameters. By integrating Kronecker-based modules, including Kronecker MLP, Kronecker Window Attention, and Kronecker Convolution, the proposed model efficiently extracts spatial features while preserving representational power. We introduce Kronecker U-Net and Kronecker SwinMR, which maintain high reconstruction quality with approximately 50% fewer parameters compared to existing models. Experimental evaluation on the FastMRI dataset demonstrates competitive PSNR, SSIM, and LPIPS metrics, even at high acceleration factors (8x and 16x), with no significant performance drop. Additionally, Kronecker variants exhibit superior generalization and reduced overfitting on limited datasets, facilitating efficient MRI reconstruction on hardware-constrained systems. This approach sets a new benchmark for parameter-efficient medical imaging models.
中文: 本文提出了一种基于克罗内克参数化超复数神经网络的轻量级MRI重建模型,在参数减少约50%的情况下仍保持高质量重建性能,并在硬件受限系统中展现出卓越的泛化能力。
English: This paper introduces a lightweight MRI reconstruction model using Kronecker-Parameterized Hypercomplex Neural Networks, achieving high performance with 50% fewer parameters while maintaining competitive reconstruction quality and superior generalization on hardware-constrained systems.

Authors:Yue Gao, Hong-Xing Yu, Bo Zhu, Jiajun Wu
Title: FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
Abstract:
We study reconstructing and predicting 3D fluid appearance and velocity from a single video. Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. Our key insight is to synthesize multiple novel-view videos as references for reconstruction. FluidNexus consists of two key components: (1) a novel-view video synthesizer that combines frame-wise view synthesis with video diffusion refinement for generating realistic videos, and (2) a physics-integrated particle representation coupling differentiable simulation and rendering to simultaneously facilitate 3D fluid reconstruction and prediction. To evaluate our approach, we collect two new real-world fluid datasets featuring textured backgrounds and object interactions. Our method enables dynamic novel view synthesis, future prediction, and interaction simulation from a single fluid video. Project website: https://yuegao.me/FluidNexus.
中文:FluidNexus提出了一种创新框架,通过从单段视频合成多视角参考视频并结合物理模拟,实现了三维流体重建与预测,并在新收集的真实场景数据集中得到验证。
English: FluidNexus introduces a novel framework that synthesizes multiple novel-view videos from a single input video and integrates physics simulation to enable 3D fluid reconstruction and prediction, validated on newly collected real-world datasets.

Authors:Xihan Wang, Dianyi Yang, Yu Gao, Yufeng Yue, Yi Yang, Mengyin Fu
Title: GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
Abstract:
Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a "Control-Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state-of-the-art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction.
中文: GaussianGraph通过融合自适应语义聚类与场景图生成技术,结合三维校正模块,显著提升了3D高斯溅射在语义分割和空间推理上的性能,为复杂场景理解提供了可靠解决方案。
English: GaussianGraph enhances 3D Gaussian Splatting by integrating adaptive semantic clustering and scene graph generation with 3D correction, significantly improving segmentation accuracy and spatial reasoning for robust scene understanding.

Authors:Elizabeth Bates, Chris Hicks, Vasilios Mavroudis
Title: Less is more? Rewards in RL for Cyber Defence
Abstract:
The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense "scaffolded" reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.
中文: 最新研究表明,相较于传统的密集奖励函数,稀疏奖励机制(特别是对未受攻击网络状态的正面强化)能够在不同网络规模和攻防策略下训练出更有效且更稳定的网络防御智能体。
English: Recent research demonstrates that sparse reward functions, especially positive reinforcement for maintaining an uncompromised network, enable training more effective and stable cyber defense agents across various network sizes and defensive actions compared to traditional dense rewards.

Authors:Abdul Basit, Nouhaila Innan, Muhammad Haider Asif, Minghao Shao, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Title: PennyLang: Pioneering LLM-Based Quantum Code Generation with a Novel PennyLane-Centric Dataset
Abstract:
Large Language Models (LLMs) offer powerful capabilities in code generation, natural language understanding, and domain-specific reasoning. Their application to quantum software development remains limited, in part because of the lack of high-quality datasets both for LLM training and as dependable knowledge sources. To bridge this gap, we introduce PennyLang, an off-the-shelf, high-quality dataset of 3,347 PennyLane-specific quantum code samples with contextual descriptions, curated from textbooks, official documentation, and open-source repositories. Our contributions are threefold: (1) the creation and open-source release of PennyLang, a purpose-built dataset for quantum programming with PennyLane; (2) a framework for automated quantum code dataset construction that systematizes curation, annotation, and formatting to maximize downstream LLM usability; and (3) a baseline evaluation of the dataset across multiple open-source models, including ablation studies, all conducted within a retrieval-augmented generation (RAG) pipeline. Using PennyLang with RAG substantially improves performance: for example, Qwen 7B's success rate rises from 8.7% without retrieval to 41.7% with full-context augmentation, and LLaMa 4 improves from 78.8% to 84.8%, while also reducing hallucinations and enhancing quantum code correctness. Moving beyond Qiskit-focused studies, we bring LLM-based tools and reproducible methods to PennyLane for advancing AI-assisted quantum development.
中文: PennyLang数据集填补了量子软件开发中的空白,提供了3,347个高质量的PennyLane代码样本,通过检索增强生成显著提升了大语言模型的性能并减少了错误。
English: The PennyLang dataset bridges the gap in quantum software development by providing 3,347 high-quality PennyLane code samples, significantly enhancing LLM performance through retrieval-augmented generation and reducing errors.

Authors:Dianyi Yang, Yu Gao, Xihan Wang, Yufeng Yue, Yi Yang, Mengyin Fu
Title: OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding
Abstract:
Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10 times faster semantic rendering and 2 times lower storage costs compared to existing methods. Project page: https://young-bit.github.io/opengs-github.github.io/.
中文摘要:OpenGS-SLAM提出了一种基于3D高斯表示的开集语义SLAM框架,通过集成2D基础模型的显式语义标签,实现了比现有方法快10倍的语义渲染速度和降低50%的存储成本。
English Summary: OpenGS-SLAM introduces an open-set semantic SLAM framework using 3D Gaussian representation with explicit semantic labels from 2D foundational models, achieving faster semantic rendering and lower storage costs than existing methods.

Authors:Yujie Qin, Mustafa A. Kishk, Mohamed-Slim Alouini
Title: Velocity-Aware Statistical Analysis of Peak AoI for Ground and Aerial Users
Abstract:
In this paper, we present a framework to analyze the impact of user velocity on the distribution of the peak age-of-information (PAoI) for both ground and aerial users by using the dominant interferer-based approximation. We first approximate the SINR meta distribution for the uplink transmission using the distances between the serving base station (BS) and each of the user of interest and the dominant interfering user, which is the interferer that provides the strongest average received power at the tagged BS. We then analyze the spatio-temporal correlation coefficient of the conditional success probability by studying the correlation between the aforementioned two distances. Finally, we choose PAoI as a performance metric to showcase how spatio-temporal correlation or user velocity affect system performance. Our results reveal that ground users exhibit higher spatio-temporal correlations compared to aerial users, resulting in a more pronounced impact of velocity on system performance, such as joint probability of the conditional success probability and distribution of PAoI. Furthermore, our work demonstrates that the dominant interferer-based approximation for the SINR meta distribution delivers good matching performance in complex scenarios, such as Nakagami-m fading model, and it can also be effectively utilized in computing spatio-temporal correlation, as this approximation is derived from the distances to the serving BS and the dominant interferer.
中文: 本文通过主导干扰源近似框架分析用户速度对峰值信息时效分布的影响,结果表明地面用户比空中用户具有更强的时空相关性,其系统性能受速度影响更为显著。
English: This paper introduces a framework using dominant interferer-based approximation to analyze how user velocity affects the peak age-of-information distribution, revealing that ground users experience stronger spatio-temporal correlations and greater velocity impact than aerial users.

Authors:Yinqian Sun, Feifei Zhao, Mingyang Lv, Yi Zeng
Title: Spiking World Model with Multi-Compartment Neurons for Model-based Reinforcement Learning
Abstract:
Brain-inspired spiking neural networks (SNNs) have garnered significant research attention in algorithm design and perception applications. However, their potential in the decision-making domain, particularly in model-based reinforcement learning, remains underexplored. The difficulty lies in the need for spiking neurons with long-term temporal memory capabilities, as well as network optimization that can integrate and learn information for accurate predictions. The dynamic dendritic information integration mechanism of biological neurons brings us valuable insights for addressing these challenges. In this study, we propose a multi-compartment neuron model capable of nonlinearly integrating information from multiple dendritic sources to dynamically process long sequential inputs. Based on this model, we construct a Spiking World Model (Spiking-WM), to enable model-based deep reinforcement learning (DRL) with SNNs. We evaluated our model using the DeepMind Control Suite, demonstrating that Spiking-WM outperforms existing SNN-based models and achieves performance comparable to artificial neural network (ANN)-based world models employing Gated Recurrent Units (GRUs). Furthermore, we assess the long-term memory capabilities of the proposed model in speech datasets, including SHD, TIMIT, and LibriSpeech 100h, showing that our multi-compartment neuron model surpasses other SNN-based architectures in processing long sequences. Our findings underscore the critical role of dendritic information integration in shaping neuronal function, emphasizing the importance of cooperative dendritic processing in enhancing neural computation.
中文摘要:本研究提出了一种多室神经元模型,通过动态树突信息整合实现了基于脉冲神经网络的强化学习,在控制任务和长序列处理中表现出优于现有模型的性能,并验证了树突协同处理对神经计算的重要作用。
English Summary: This study introduces a multi-compartment spiking neuron model that enables effective model-based reinforcement learning through dynamic dendritic information integration, demonstrating superior performance in control tasks and long-sequence processing compared to existing spiking neural networks.

Authors:Zhangcun Yan, Jianqing Li, Peng Hang, Jian Sun
Title: OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users
Abstract:
With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing the diversity and dynamics of VRU behaviors, making it difficult to meet the research demands of complex traffic environments. To address this gap, this study developed the OnSiteVRU datasets, which cover a variety of scenarios, including intersections, road segments, and urban villages. These datasets provide trajectory data for motor vehicles, electric bicycles, and human-powered bicycles, totaling approximately 17,429 trajectories with a precision of 0.04 seconds. The datasets integrate both aerial-view natural driving data and onboard real-time dynamic detection data, along with environmental information such as traffic signals, obstacles, and real-time maps, enabling a comprehensive reconstruction of interaction events. The results demonstrate that VRU\_Data outperforms traditional datasets in terms of VRU density and scene coverage, offering a more comprehensive representation of VRU behavioral characteristics. This provides critical support for traffic flow modeling, trajectory prediction, and autonomous driving virtual testing. The dataset is publicly available for download at: https://www.kaggle.com/datasets/zcyan2/mixed-traffic-trajectory-dataset-in-from-shanghai.
中文摘要:OnSiteVRU数据集通过提供高精度、多场景的轨迹数据,弥补了弱势道路使用者行为数据多样性和动态性不足的问题,为自动驾驶研究和交通建模提供了更全面的支持。
English Summary: The OnSiteVRU dataset addresses the lack of diverse and dynamic trajectory data for vulnerable road users by providing high-precision, multi-scenario trajectory data that enhances autonomous driving research and traffic modeling.

Authors:Lyuye Zhang, Jiahui Wu, Chengwei Liu, Kaixuan Li, Xiaoyu Sun, Lida Zhao, Chong Wang, Yang Liu
Title: Fixing Outside the Box: Uncovering Tactics for Open-Source Security Issue Management
Abstract:
In the rapidly evolving landscape of software development, addressing security vulnerabilities in open-source software (OSS) has become critically important. However, existing research and tools from both academia and industry mainly relied on limited solutions, such as vulnerable version adjustment and adopting patches, to handle identified vulnerabilities. However, far more flexible and diverse countermeasures have been actively adopted in the open-source communities. A holistic empirical study is needed to explore the prevalence, distribution, preferences, and effectiveness of these diverse strategies. To this end, in this paper, we conduct a comprehensive study on the taxonomy of vulnerability remediation tactics (RT) in OSS projects and investigate their pros and cons. This study addresses this oversight by conducting a comprehensive empirical analysis of 21,187 issues from GitHub, aiming to understand the range and efficacy of remediation tactics within the OSS community. We developed a hierarchical taxonomy of 44 distinct RT and evaluated their effectiveness and costs. Our findings highlight a significant reliance on community-driven strategies, like using alternative libraries and bypassing vulnerabilities, 44% of which are currently unsupported by cutting-edge tools. Additionally, this research exposes the community's preferences for certain fixing approaches by analyzing their acceptance and the reasons for rejection. It also underscores a critical gap in modern vulnerability databases, where 54% of CVEs lack fixing suggestions, a gap that can be significantly mitigated by leveraging the 93% of actionable solutions provided through GitHub issues.
中文: 本研究通过对21,187个GitHub问题进行实证分析,建立了44种漏洞修复策略的分类体系,发现44%的社区驱动策略未被现有工具支持,54%的CVE缺乏修复建议,而GitHub问题提供了93%可操作的解决方案。
English: This study conducts a comprehensive empirical analysis of 21,187 GitHub issues to develop a taxonomy of 44 vulnerability remediation tactics, revealing that 44% of community-driven strategies are unsupported by current tools and 54% of CVEs lack fixing suggestions, while GitHub issues provide 93% actionable solutions.

Authors:Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, Hongkai Xiong
Title: HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation
Abstract:
Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
中文: 现有2D转3D人体姿态估计方法在提升阶段通过丰富时间和视觉信息来应对遮挡问题,但忽视了稀疏2D骨架输入的根本限制,这既制约了性能又加剧了遮挡难题。
English: Current 2D-to-3D human pose estimation methods face occlusion challenges by enhancing temporal and visual information during lifting, but overlook the limitations of sparse 2D skeleton inputs, which fundamentally restrict performance and exacerbate occlusion issues.

Authors:Jinxu Lin, Linwei Tao, Minjing Dong, Chang Xu
Title: Uncertainty Weighted Gradients for Model Calibration
Abstract:
Model calibration is essential for ensuring that the predictions of deep neural networks accurately reflect true probabilities in real-world classification tasks. However, deep networks often produce over-confident or under-confident predictions, leading to miscalibration. Various methods have been proposed to address this issue by designing effective loss functions for calibration, such as focal loss. In this paper, we analyze its effectiveness and provide a unified loss framework of focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor that estimates sample-wise uncertainty. Based on our analysis, existing loss functions fail to achieve optimal calibration performance due to two main issues: including misalignment during optimization and insufficient precision in uncertainty estimation. Specifically, focal loss cannot align sample uncertainty with gradient scaling and the single logit cannot indicate the uncertainty. To address these issues, we reformulate the optimization from the perspective of gradients, which focuses on uncertain samples. Meanwhile, we propose using the Brier Score as the loss weight factor, which provides a more accurate uncertainty estimation via all the logits. Extensive experiments on various models and datasets demonstrate that our method achieves state-of-the-art (SOTA) performance.
中文: 本文分析了现有校准损失函数的不足,提出了一种统一框架,通过基于梯度的优化关注不确定样本,并采用Brier评分作为权重因子来精确估计不确定性,实验表明该方法达到了最优性能。
English: This paper identifies limitations in existing calibration loss functions like focal loss and proposes a unified framework that reformulates optimization through gradient focus on uncertain samples and uses the Brier Score for precise uncertainty estimation, achieving state-of-the-art performance.

Authors:Lyuye Zhang, Chengwei Liu, Jiahui Wu, Shiyang Zhang, Chengyue Liu, Zhengzi Xu, Sen Chen, Yang Liu
Title: Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis
Abstract:
The prevalent use of third-party libraries (TPLs) in modern software development introduces significant security and compliance risks, necessitating the implementation of Software Composition Analysis (SCA) to manage these threats. However, the accuracy of SCA tools heavily relies on the quality of the integrated feature database to cross-reference with user projects. While under the circumstance of the exponentially growing of open-source ecosystems and the integration of large models into software development, it becomes even more challenging to maintain a comprehensive feature database for potential TPLs. To this end, after referring to the evolution of LLM applications in terms of external data interactions, we propose the first framework of DB-Less SCA, to get rid of the traditional heavy database and embrace the flexibility of LLMs to mimic the manual analysis of security analysts to retrieve identical evidence and confirm the identity of TPLs by supportive information from the open Internet. Our experiments on two typical scenarios, native library identification for Android and copy-based TPL reuse for C/C++, especially on artifacts that are not that underappreciated, have demonstrated the favorable future for implementing database-less strategies in SCA.
中文摘要:提出的无数据库软件成分分析框架通过利用大语言模型自动检索网络证据来识别第三方库,在安卓和C/C++场景的实验中展现了替代传统特征数据库的可行性。
English Summary: The proposed DB-Less SCA framework eliminates reliance on traditional feature databases by leveraging LLMs to automatically identify third-party libraries through online evidence retrieval, showing promising results in Android and C/C++ scenarios.

Authors:Haicheng Liao, Hanlin Kong, Bin Rao, Bonan Wang, Chengyue Wang, Guyang Yu, Yuming Huang, Ruru Tang, Chengzhong Xu, Zhenning Li
Title: SafeCast: Risk-Responsive Motion Forecasting for Autonomous Vehicles
Abstract:
Accurate motion forecasting is essential for the safety and reliability of autonomous driving (AD) systems. While existing methods have made significant progress, they often overlook explicit safety constraints and struggle to capture the complex interactions among traffic agents, environmental factors, and motion dynamics. To address these challenges, we present SafeCast, a risk-responsive motion forecasting model that integrates safety-aware decision-making with uncertainty-aware adaptability. SafeCast is the first to incorporate the Responsibility-Sensitive Safety (RSS) framework into motion forecasting, encoding interpretable safety rules--such as safe distances and collision avoidance--based on traffic norms and physical principles. To further enhance robustness, we introduce the Graph Uncertainty Feature (GUF), a graph-based module that injects learnable noise into Graph Attention Networks, capturing real-world uncertainties and enhancing generalization across diverse scenarios. We evaluate SafeCast on four real-world benchmark datasets--Next Generation Simulation (NGSIM), Highway Drone (HighD), ApolloScape, and the Macao Connected Autonomous Driving (MoCAD)--covering highway, urban, and mixed-autonomy traffic environments. Our model achieves state-of-the-art (SOTA) accuracy while maintaining a lightweight architecture and low inference latency, underscoring its potential for real-time deployment in safety-critical AD systems.
Chinese: SafeCast是一种风险响应型运动预测模型,通过整合责任敏感安全框架和图不确定性特征模块,提升了自动驾驶的安全性与适应性,在多个真实世界数据集上实现了最先进的性能。
English: SafeCast is a risk-responsive motion forecasting model that integrates the Responsibility-Sensitive Safety framework and a Graph Uncertainty Feature module to enhance safety and adaptability in autonomous driving, achieving state-of-the-art performance across multiple real-world datasets.

Authors:Shuze Wang, Yunpeng Mei, Hongjie Cao, Yetian Yuan, Gang Wang, Jian Sun, Jie Chen
Title: Robust Offline Imitation Learning Through State-level Trajectory Stitching
Abstract:
Imitation learning (IL) has proven effective for enabling robots to acquire visuomotor skills through expert demonstrations. However, traditional IL methods are limited by their reliance on high-quality, often scarce, expert data, and suffer from covariate shift. To address these challenges, recent advances in offline IL have incorporated suboptimal, unlabeled datasets into the training. In this paper, we propose a novel approach to enhance policy learning from mixed-quality offline datasets by leveraging task-relevant trajectory fragments and rich environmental dynamics. Specifically, we introduce a state-based search framework that stitches state-action pairs from imperfect demonstrations, generating more diverse and informative training trajectories. Experimental results on standard IL benchmarks and real-world robotic tasks showcase that our proposed method significantly improves both generalization and performance.
Chinese Summary: 本文提出了一种新颖的离线模仿学习方法,通过拼接混合质量数据集中的轨迹片段来增强策略学习,显著提升了机器人任务中的泛化能力和性能表现。
English Summary: This paper introduces a novel offline imitation learning approach that enhances policy learning by stitching trajectory fragments from mixed-quality datasets, significantly improving generalization and performance in robotic tasks.

Authors:Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu
Title: Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
Abstract:
With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects remain largely unexplored. In this work, we introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion. With this dataset, we conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects. 2) Mainstream methods adopt either one-stage left-to-right generation or warp-and-inpaint pipeline, facing challenges of degraded stereo effect and image distortion respectively. Based on these findings, we introduce a new evaluation metric, Stereo Intersection-over-Union, which prioritizes disparity and achieves a high correlation with human judgments on stereo effect. Moreover, we propose a strong baseline model, harmonizing the stereo effect and image quality simultaneously, and notably surpassing current mainstream methods. Our code and data will be open-sourced to promote further research in stereo conversion. Our models are available at mono2stereo-bench.github.io.
中文摘要:本研究通过推出Mono2Stereo数据集,揭示了现有立体转换评估指标和方法的不足,并提出新的评估指标与基线模型,显著超越了当前主流方法的性能。
English Summary: This study addresses stereo conversion challenges by introducing the Mono2Stereo dataset and revealing limitations in current metrics and methods, proposing a new evaluation metric and baseline model that significantly outperforms existing approaches.

Authors:Zhaojun Nan, Yunchu Han, Sheng Zhou, Zhisheng Niu
Title: Robust DNN Partitioning and Resource Allocation Under Uncertain Inference Time
Abstract:
In edge intelligence systems, deep neural network (DNN) partitioning and data offloading can provide real-time task inference for resource-constrained mobile devices. However, the inference time of DNNs is typically uncertain and cannot be precisely determined in advance, presenting significant challenges in ensuring timely task processing within deadlines. To address the uncertain inference time, we propose a robust optimization scheme to minimize the total energy consumption of mobile devices while meeting task probabilistic deadlines. The scheme only requires the mean and variance information of the inference time, without any prediction methods or distribution functions. The problem is formulated as a mixed-integer nonlinear programming (MINLP) that involves jointly optimizing the DNN model partitioning and the allocation of local CPU/GPU frequencies and uplink bandwidth. To tackle the problem, we first decompose the original problem into two subproblems: resource allocation and DNN model partitioning. Subsequently, the two subproblems with probability constraints are equivalently transformed into deterministic optimization problems using the chance-constrained programming (CCP) method. Finally, the convex optimization technique and the penalty convex-concave procedure (PCCP) technique are employed to obtain the optimal solution of the resource allocation subproblem and a stationary point of the DNN model partitioning subproblem, respectively. The proposed algorithm leverages real-world data from popular hardware platforms and is evaluated on widely used DNN models. Extensive simulations show that our proposed algorithm effectively addresses the inference time uncertainty with probabilistic deadline guarantees while minimizing the energy consumption of mobile devices.
中文: 本文提出一种边缘智能系统的鲁棒优化方案,仅需推理时间的均值和方差信息,通过联合优化DNN模型划分与资源分配,在满足概率截止时间约束的同时最小化移动设备能耗。
English: This paper introduces a robust optimization scheme for edge intelligence systems that minimizes mobile device energy consumption while meeting probabilistic task deadlines by jointly optimizing DNN partitioning and resource allocation, using only the mean and variance of uncertain inference times without requiring distribution functions.

Authors:Junhao Wu, Yixin Yang, Chengxiang Jin, Silu Mu, Xiaolei Qian, Jiajun Zhou, Shanqing Yu, Qi Xuan
Title: Unveiling Latent Information in Transaction Hashes: Hypergraph Learning for Ethereum Ponzi Scheme Detection
Abstract:
With the widespread adoption of Ethereum, financial frauds such as Ponzi schemes have become increasingly rampant in the blockchain ecosystem, posing significant threats to the security of account assets. Existing Ethereum fraud detection methods typically model account transactions as graphs, but this approach primarily focuses on binary transactional relationships between accounts, failing to adequately capture the complex multi-party interaction patterns inherent in Ethereum. To address this, we propose a hypergraph modeling method for the Ponzi scheme detection method in Ethereum, called HyperDet. Specifically, we treat transaction hashes as hyperedges that connect all the relevant accounts involved in a transaction. Additionally, we design a two-step hypergraph sampling strategy to significantly reduce computational complexity. Furthermore, we introduce a dual-channel detection module, including the hypergraph detection channel and the hyper-homo graph detection channel, to be compatible with existing detection methods. Experimental results show that, compared to traditional homogeneous graph-based methods, the hyper-homo graph detection channel achieves significant performance improvements, demonstrating the superiority of hypergraph in Ponzi scheme detection. This research offers innovations for modeling complex relationships in blockchain data.
中文摘要:该研究提出HyperDet,一种基于超图的以太坊庞氏骗局检测方法,通过将交易建模为超边来捕捉复杂的多方交互模式,并利用双通道检测模块显著优于传统图方法。
English Summary: The study introduces HyperDet, a hypergraph-based method for detecting Ponzi schemes on Ethereum, which models transactions as hyperedges to capture complex multi-party interactions and outperforms traditional graph approaches through a dual-channel detection module.

Authors:Tao Meng, Shuo Shan, Hongen Shao, Yuntao Shou, Wei Ai, Keqin Li
Title: SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity Alignment
Abstract:
Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.
中文摘要:提出的SE-GNN模型通过融合语义属性和结构特征生成高质量初始潜在种子对,并采用带嵌入校正的迭代优化机制,有效缓解知识图谱结构异构性并消除噪声影响。
English Summary: The proposed SE-GNN model enhances entity alignment by integrating semantic attributes and structural features to generate high-quality potential seed pairs, while employing iterative optimization with embedding correction to mitigate structural heterogeneity and noise impact.

Authors:Nan Jiang, Hongjie Li, Ziye Yuan, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang
Title: Dynamic Motion Blending for Versatile Motion Editing
Abstract:
Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.
中文摘要:MotionReFit通过自回归扩散模型与运动协调器,实现了无需外部规范或大型语言模型的最先进文本引导运动编辑,动态生成训练数据并解决运动不协调问题。
English Summary: MotionReFit, an auto-regressive diffusion model with a motion coordinator, achieves state-of-the-art text-guided motion editing by dynamically generating training data and addressing motion incoordination without relying on external specifications or Large Language Models.

Authors:Abdulhamid Abubakar, Hamidatu Abdulkadir, Ibrahim Rabiu Abdullahi, Abubakar Auwal Khalid, Ahmad Mustapha Wali, Amina Aminu Umar, Maryam Bala, Sani Abdullahi Sani, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Vukosi Marivate
Title: HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation
Abstract:
This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.
中文: 本文阐述了SemEval 2025任务二中关于实体感知机器翻译的系统方法、实验结果与研究启示,重点探讨了英语到10种目标语言的命名实体精准翻译问题。
English: This paper details the systems, results, and insights from SemEval 2025 Task 2 on entity-aware machine translation, which focuses on accurately translating English sentences into 10 target languages with special attention to named entities.

Authors:Ibrahim Said Ahmad, Shiran Dudy, Tadesse Destaw Belay, Idris Abdulmumin, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Kenneth Church
Title: Exploring Cultural Nuances in Emotion Perception Across 15 African Languages
Abstract:
Understanding how emotions are expressed across languages is vital for building culturally-aware and inclusive NLP systems. However, emotion expression in African languages is understudied, limiting the development of effective emotion detection tools in these languages. In this work, we present a cross-linguistic analysis of emotion expression in 15 African languages. We examine four key dimensions of emotion representation: text length, sentiment polarity, emotion co-occurrence, and intensity variations. Our findings reveal diverse language-specific patterns in emotional expression -- with Somali texts typically longer, while others like IsiZulu and Algerian Arabic show more concise emotional expression. We observe a higher prevalence of negative sentiment in several Nigerian languages compared to lower negativity in languages like IsiXhosa. Further, emotion co-occurrence analysis demonstrates strong cross-linguistic associations between specific emotion pairs (anger-disgust, sadness-fear), suggesting universal psychological connections. Intensity distributions show multimodal patterns with significant variations between language families; Bantu languages display similar yet distinct profiles, while Afroasiatic languages and Nigerian Pidgin demonstrate wider intensity ranges. These findings highlight the need for language-specific approaches to emotion detection while identifying opportunities for transfer learning across related languages.
中文: 对15种非洲语言的跨语言研究表明,不同语言在文本长度、情感极性、情绪共现和强度方面呈现独特的情感表达模式,强调需要针对特定语言开发情感检测模型,同时发现相关语言间存在迁移学习的可能性。
English: This cross-linguistic study of 15 African languages reveals distinct emotional expression patterns across text length, sentiment polarity, emotion co-occurrence, and intensity, emphasizing the necessity for language-specific emotion detection models while identifying transfer learning potential among related languages.

Authors:Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
Title: Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Abstract:
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.
中文摘要:提出的异步轨迹平衡(TBA)系统通过解耦训练与搜索,利用离策略数据生成和经验回放机制,实现了大语言模型强化学习的大规模扩展,在数学推理、偏好调优等后训练任务中取得了显著的速度与性能提升。
English Summary: The proposed Trajectory Balance with Asynchrony (TBA) system enables scalable reinforcement learning for large language models by decoupling training and search through off-policy data generation and experience replay, achieving significant speed and performance gains across diverse post-training tasks.

Authors:Chak Lam Shek, Amrit Singh Bedi, Anjon Basak, Ellen Novoseller, Nick Waytowich, Priya Narayanan, Dinesh Manocha, Pratap Tokekar
Title: Learning Multi-Robot Coordination through Locality-Based Factorized Multi-Agent Actor-Critic Algorithm
Abstract:
In this work, we present a novel cooperative multi-agent reinforcement learning method called \textbf{Loc}ality based \textbf{Fac}torized \textbf{M}ulti-Agent \textbf{A}ctor-\textbf{C}ritic (Loc-FACMAC). Existing state-of-the-art algorithms, such as FACMAC, rely on global reward information, which may not accurately reflect the quality of individual robots' actions in decentralized systems. We integrate the concept of locality into critic learning, where strongly related robots form partitions during training. Robots within the same partition have a greater impact on each other, leading to more precise policy evaluation. Additionally, we construct a dependency graph to capture the relationships between robots, facilitating the partitioning process. This approach mitigates the curse of dimensionality and prevents robots from using irrelevant information. Our method improves existing algorithms by focusing on local rewards and leveraging partition-based learning to enhance training efficiency and performance. We evaluate the performance of Loc-FACMAC in three environments: Hallway, Multi-cartpole, and Bounded-Cooperative-Navigation. We explore the impact of partition sizes on the performance and compare the result with baseline MARL algorithms such as LOMAQ, FACMAC, and QMIX. The experiments reveal that, if the locality structure is defined properly, Loc-FACMAC outperforms these baseline algorithms up to 108\%, indicating that exploiting the locality structure in the actor-critic framework improves the MARL performance.
中文: 本文提出Loc-FACMAC方法,通过将强关联的机器人划分为分区并在评论家学习中利用局部奖励,有效提升了多智能体强化学习的训练效率和性能,实验表明其在适当定义局部结构时性能优于基线算法最高达108%。
English: This paper introduces Loc-FACMAC, a cooperative multi-agent reinforcement learning method that incorporates locality into critic learning by forming partitions of strongly related robots and using local rewards to enhance training efficiency and performance, outperforming baseline algorithms by up to 108% in experiments.

Authors:Tadesse Destaw Belay, Israel Abebe Azime, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Idris Abdulmumin, Abinew Ali Ayele, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Title: AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text
Abstract:
Language models built from various sources are the foundation of today's NLP progress. However, for many low-resource languages, the diversity of domains is often limited, more biased to a religious domain, which impacts their performance when evaluated on distant and rapidly evolving domains such as social media. Domain adaptive pre-training (DAPT) and task-adaptive pre-training (TAPT) are popular techniques to reduce this bias through continual pre-training for BERT-based models, but they have not been explored for African multilingual encoders. In this paper, we explore DAPT and TAPT continual pre-training approaches for African languages social media domain. We introduce AfriSocial, a large-scale social media and news domain corpus for continual pre-training on several African languages. Leveraging AfriSocial, we show that DAPT consistently improves performance (from 1% to 30% F1 score) on three subjective tasks: sentiment analysis, multi-label emotion, and hate speech classification, covering 19 languages. Similarly, leveraging TAPT on the data from one task enhances performance on other related tasks. For example, training with unlabeled sentiment data (source) for a fine-grained emotion classification task (target) improves the baseline results by an F1 score ranging from 0.55% to 15.11%. Combining these two methods (i.e. DAPT + TAPT) further improves the overall performance. The data and model resources are available at HuggingFace.
Chinese: 通过利用新引入的AfriSocial语料库,领域自适应和任务自适应预训练方法显著提升了非洲多语言模型在社交媒体任务上的表现。
English: Domain adaptive and task adaptive pre-training methods significantly enhance the performance of African multilingual language models on social media tasks by leveraging the newly introduced AfriSocial corpus.

Authors:Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao
Title: Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Abstract:
We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.
中文: 本文提出了一种无需训练的"自相干引导"方法,通过动态优化生成过程中的交叉注意力图,显著提升了基于Transformer的文本引导扩散模型的对齐效果,在多种属性和风格绑定任务上均超越了现有最优方法。
English: This paper presents a training-free method called Self-Coherence Guidance that enhances alignment in Transformer-based Text-Guided Diffusion Models by dynamically optimizing cross-attention maps during generation, achieving superior performance across various attribute and style binding tasks.

Authors:Yiming Cui, Shiyu Fang, Peng Hang, Jian Sun
Title: A Vehicle-Infrastructure Multi-layer Cooperative Decision-making Framework
Abstract:
Autonomous driving has entered the testing phase, but due to the limited decision-making capabilities of individual vehicle algorithms, safety and efficiency issues have become more apparent in complex scenarios. With the advancement of connected communication technologies, autonomous vehicles equipped with connectivity can leverage vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications, offering a potential solution to the decision-making challenges from individual vehicle's perspective. We propose a multi-level vehicle-infrastructure cooperative decision-making framework for complex conflict scenarios at unsignalized intersections. First, based on vehicle states, we define a method for quantifying vehicle impacts and their propagation relationships, using accumulated impact to group vehicles through motif-based graph clustering. Next, within and between vehicle groups, a pass order negotiation process based on Large Language Models (LLM) is employed to determine the vehicle passage order, resulting in planned vehicle actions. Simulation results from ablation experiments show that our approach reduces negotiation complexity and ensures safer, more efficient vehicle passage at intersections, aligning with natural decision-making logic.
中文: 提出的多层次车路协同决策框架通过影响量化与基于大语言模型的协商,降低了无信号灯路口的通行复杂性,提高了安全性和效率。
English: The proposed multi-level vehicle-infrastructure cooperative decision-making framework uses impact quantification and LLM-based negotiation to reduce complexity and enhance safety and efficiency at unsignalized intersections.

Authors:Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, Hua Wei
Title: Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey
Abstract:
Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.
中文: 大语言模型在文本生成和推理方面表现出色,但其可靠性存疑,常产生看似合理却错误的回答,需通过改进的不确定性量化方法应对输入模糊性、推理路径差异等独特挑战,以提升关键应用中的可信度。
English: Large Language Models (LLMs) face reliability issues due to producing plausible but incorrect responses, requiring improved uncertainty quantification methods to address unique challenges like input ambiguity and reasoning divergence for enhanced trustworthiness in critical applications.

Authors:Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li
Title: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Abstract:
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.
中文:大语言模型代理需优化多轮交互,为此提出的SWEET-RL算法通过创新训练目标和评论家模型,在ColBench基准测试中显著提升了任务成功率与胜率。
English: Large language model agents require effective multi-turn interaction optimization, which is addressed by the proposed SWEET-RL algorithm using a novel training objective and critic model to achieve significant performance improvements on the new ColBench benchmark.

Authors:Alessandro Traspadini, Anay Ajit Deshpande, Marco Giordani, Chinmay Mahabal, Takayuki Shimizu, Michele Zorzi
Title: Sensing-Based Beamformed Resource Allocation in Standalone Millimeter-Wave Vehicular Networks
Abstract:
In 3GPP New Radio (NR) Vehicle-to-Everything (V2X), the new standard for next-generation vehicular networks, vehicles can autonomously select sidelink resources for data transmission, which permits network operations without cellular coverage. However, standalone resource allocation is uncoordinated, and is complicated by the high mobility of the nodes that may introduce unforeseen channel collisions (e.g., when a transmitting vehicle changes path) or free up resources (e.g., when a vehicle moves outside of the communication area). Moreover, unscheduled resource allocation is prone to the hidden node and exposed node problems, which are particularly critical considering directional transmissions. In this paper, we implement and demonstrate a new channel access scheme for NR V2X in Frequency Range 2 (FR2), i.e., at millimeter wave (mmWave) frequencies, based on directional and beamformed transmissions along with Sidelink Control Information (SCI) to select resources for transmission. We prove via simulation that this approach can reduce the probability of collision for resource allocation, compared to a baseline solution that does not configure SCI transmissions.
中文: 本文针对3GPP NR V2X在毫米波频段提出了一种新的信道接入方案,通过方向性传输和侧链路控制信息优化资源选择,仿真证明相比未配置SCI的基准方案能有效降低资源分配冲突概率。
English: This paper introduces a new channel access scheme for 3GPP NR V2X in millimeter wave frequencies that utilizes directional transmissions and Sidelink Control Information to enhance resource selection, demonstrating through simulations a reduced collision probability compared to baseline methods without SCI configuration.

Authors:Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, Siyuan Huang
Title: Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Abstract:
Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.
中文: 提出的DP-Recon方法通过引入扩散先验的分数蒸馏采样和可见性引导损失调整,显著提升了稀疏视角下的三维场景重建质量,在改善几何与纹理恢复的同时实现了先进的编辑功能。
English: The proposed DP-Recon method enhances 3D scene reconstruction from sparse views by integrating diffusion priors through Score Distillation Sampling and a visibility-guided loss adjustment, significantly improving geometry and texture recovery while enabling advanced editing capabilities.

Authors:Babangida Sani, Aakansha Soy, Sukairaj Hafiz Imam, Ahmad Mustapha, Lukman Jibril Aliyu, Idris Abdulmumin, Ibrahim Said Ahmad, Shamsuddeen Hassan Muhammad
Title: Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa
Abstract:
The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.
Chinese: 本研究通过微调非洲中心模型开发了首个大规模豪萨语文本检测器,其中AfroXLMR以99.23%的准确率表现最优,填补了低资源语言检测空白并公开了数据集。
English: This study developed the first large-scale detector for distinguishing human- and machine-generated Hausa text by fine-tuning Afri-centric models, with AfroXLMR achieving 99.23% accuracy, while addressing the gap in low-resource language detection and making the dataset publicly available.

Authors:Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang
Title: TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Abstract:
Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at https://jason-aplp.github.io/TACO.
中文: 本文提出TACO条件扩散模型,通过利用预训练视频扩散模型的丰富流形,在非结构化真实场景视频中实现一致的视频全貌补全,并在多种数据集与下游任务中展现出卓越的泛化能力。
English: This paper introduces TACO, a conditional diffusion model that leverages pre-trained video diffusion models to achieve consistent video amodal completion for objects in unstructured, in-the-wild videos, demonstrating strong generalization across diverse datasets and downstream applications.

Authors:Jiaqi Jin, Siwei Wang, Zhibin Dong, Xihong Yang, Xinwang Liu, En Zhu, Kunlun He
Title: Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance
Abstract:
Multi-view clustering leverages complementary representations from diverse sources to enhance performance. However, real-world data often suffer incomplete cases due to factors like privacy concerns and device malfunctions. A key challenge is effectively utilizing available instances to recover missing views. Existing methods frequently overlook the heterogeneity among views during recovery, leading to significant distribution discrepancies between recovered and true data. Additionally, many approaches focus on cross-view correlations, neglecting insights from intra-view reliable structure and cross-view clustering structure. To address these issues, we propose BURG, a novel method for incomplete multi-view clustering with distriBution dUal-consistency Recovery Guidance. We treat each sample as a distinct category and perform cross-view distribution transfer to predict the distribution space of missing views. To compensate for the lack of reliable category information, we design a dual-consistency guided recovery strategy that includes intra-view alignment guided by neighbor-aware consistency and cross-view alignment guided by prototypical consistency. Extensive experiments on benchmarks demonstrate the superiority of BURG in the incomplete multi-view scenario.
中文摘要:提出的BURG方法通过分布双重一致性恢复指导解决不完整多视图聚类问题,有效处理视图异质性并利用视图内和视图间结构信息来提升性能。
English Summary: The proposed BURG method addresses incomplete multi-view clustering by introducing distribution dual-consistency recovery guidance, which effectively handles view heterogeneity and leverages both intra-view and cross-view structural information to improve performance.

Authors:Jiarui Sun, Chin-Chia Michael Yeh, Yujie Fan, Xin Dai, Xiran Fan, Zhimeng Jiang, Uday Singh Saini, Vivian Lai, Junpeng Wang, Huiyuan Chen, Zhongfang Zhuang, Yan Zheng, Girish Chowdhary
Title: Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers
Abstract:
Time series forecasting at scale presents significant challenges for modern prediction systems, particularly when dealing with large sets of synchronized series, such as in a global payment network. In such systems, three key challenges must be overcome for accurate and scalable predictions: 1) emergence of new entities, 2) disappearance of existing entities, and 3) the large number of entities present in the data. The recently proposed Inverted Transformer (iTransformer) architecture has shown promising results by effectively handling variable entities. However, its practical application in large-scale settings is limited by quadratic time and space complexity ($O(N^2)$) with respect to the number of entities $N$. In this paper, we introduce EiFormer, an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale ($O(N)$). Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness. Additionally, we incorporate a random projection mechanism that not only enhances efficiency but also improves prediction accuracy through better feature representation. Extensive experiments on the public LargeST benchmark dataset and a proprietary large-scale time series dataset demonstrate that EiFormer significantly outperforms existing methods in both computational efficiency and forecasting accuracy. Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.
中文摘要:EiFormer模型通过优化注意力机制和引入随机投影特征增强,将倒置Transformer的二次计算复杂度降至线性,实现了在大规模时间序列预测中既高效又精准的工业级应用。
English Summary: The EiFormer model enhances the Inverted Transformer by reducing its quadratic complexity to linear, enabling scalable and accurate time series forecasting through an optimized attention mechanism and random projection feature enhancement.

Authors:Guanrong Li, Kuo Tian, Jinnan Qi, Qinghan Fu, Zhen Wu, Xinyu Dai
Title: Harmonizing Large Language Models with Collaborative Behavioral Signals for Conversational Recommendation
Abstract:
Conversational recommendation frameworks have gained prominence as a dynamic paradigm for delivering personalized suggestions via interactive dialogues. The incorporation of advanced language understanding techniques has substantially improved the dialogue fluency of such systems. However, while modern language models demonstrate strong proficiency in interpreting user preferences articulated through natural conversation, they frequently encounter challenges in effectively utilizing collective behavioral patterns - a crucial element for generating relevant suggestions. To mitigate this limitation, this work presents a novel probabilistic framework that synergizes behavioral patterns with conversational interactions through latent preference modeling. The proposed method establishes a dual-channel alignment mechanism where implicit preference representations learned from collective user interactions serve as a connecting mechanism between behavioral data and linguistic expressions. Specifically, the framework first derives latent preference representations through established collaborative filtering techniques, then employs these representations to jointly refine both the linguistic preference expressions and behavioral patterns through an adaptive fusion process. Comprehensive evaluations across multiple benchmark datasets demonstrate the superior performance of the proposed approach compared to various state-of-the-art baseline methods, particularly in aligning conversational interactions with collaborative behavioral signals.
Chinese: 本研究提出了一种新颖的概率框架,通过潜在偏好建模将行为模式与会话交互相融合,在多个基准测试中展现出对话与协同信号对齐方面的卓越性能。
English: This study introduces a novel probabilistic framework that integrates behavioral patterns with conversational interactions through latent preference modeling, demonstrating superior performance in aligning dialogue with collaborative signals across multiple benchmarks.

Authors:Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Title: LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds
Abstract:
Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.
中文摘要:提出的LHM模型通过多模态变换器和3D高斯溅射技术,无需后处理即可高效重建高保真可动画三维人体,在重建精度和泛化能力上均优于现有方法。
English Summary: The proposed LHM model efficiently reconstructs high-fidelity animatable 3D human avatars using 3D Gaussian splatting and a multimodal transformer, outperforming existing methods in accuracy and generalization without post-processing.

Authors:Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, Xipeng Qiu
Title: World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
Abstract:
Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.
中文: 提出的双重偏好优化(D$^2$PO)框架通过联合优化状态预测和行动选择的偏好学习,显著提升了大视觉语言模型在具身任务规划中的性能,在效率和成功率上大幅优于现有方法。
English: The proposed Dual Preference Optimization (D$^2$PO) framework enhances large vision-language models' embodied task planning by jointly optimizing state prediction and action selection through automated preference learning, significantly outperforming existing methods in efficiency and success rates.

Authors:Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Title: Mirror Online Conformal Prediction with Intermittent Feedback
Abstract:
Online conformal prediction enables the runtime calibration of a pre-trained artificial intelligence model using feedback on its performance. Calibration is achieved through set predictions that are updated via online rules so as to ensure long-term coverage guarantees. While recent research has demonstrated the benefits of incorporating prior knowledge into the calibration process, this has come at the cost of replacing coverage guarantees with less tangible regret guarantees based on the quantile loss. This work introduces intermittent mirror online conformal prediction (IM-OCP), a novel runtime calibration framework that integrates prior knowledge, operates under potentially intermittent feedback, and features minimal memory complexity. IM-OCP guarantees long-term coverage and sub-linear regret, both of which hold deterministically for any given data sequence and in expectation with respect to the intermittent feedback.
中文: 在线保形预测通过在线规则更新集合预测来实时校准AI模型以确保长期覆盖保证,本研究提出的IM-OCP新框架整合了先验知识、应对间歇性反馈,并同时保证覆盖率和次线性遗憾。
English: Online conformal prediction allows real-time calibration of AI models through set predictions updated by online rules to ensure long-term coverage, and this work introduces IM-OCP, a novel framework that incorporates prior knowledge, handles intermittent feedback, and guarantees both coverage and sub-linear regret.

Authors:Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum
Title: Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization
Abstract:
Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) but often struggles to maintain performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Global pruning has the potential to find the optimal solution although resource-intensive. However, existing methods tend to rank structural saliency uniformly, ignoring inter-structure dependencies and failing to achieve end-to-end optimization. To address these limitations, we propose Týr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that Týr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.
Chinese: Týr-the-Pruner是一种高效的端到端全局结构化剪枝框架,通过优化各层稀疏度分布,在移除Llama-3.1-70B模型50%参数的同时仍保持97%的原始性能,实现了最先进的剪枝效果。
English: Týr-the-Pruner is an efficient end-to-end global structural pruning framework that optimizes sparsity distribution across layers, achieving state-of-the-art results by retaining 97% performance while removing 50% of Llama-3.1-70B's parameters.

Authors:Weiquan Wang, Jun Xiao, Yueting Zhuang, Long Chen
Title: Physics-Aware Human-Object Rendering from Sparse Views via 3D Gaussian Splatting
Abstract:
Rendering realistic human-object interactions (HOIs) from sparse-view inputs is challenging due to occlusions and incomplete observations, yet crucial for various real-world applications. Existing methods always struggle with either low rendering qualities (\eg, visual fidelity and physically plausible HOIs) or high computational costs. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient and physically plausible HOI rendering from sparse views. Specifically, HOGS combines 3D Gaussian Splatting with a physics-aware optimization process. It incorporates a Human Pose Refinement module for accurate pose estimation and a Sparse-View Human-Object Contact Prediction module for efficient contact region identification. This combination enables coherent joint rendering of human and object Gaussians while enforcing physically plausible interactions. Extensive experiments on the HODome dataset demonstrate that HOGS achieves superior rendering quality, efficiency, and physical plausibility compared to existing methods. We further show its extensibility to hand-object grasp rendering tasks, presenting its broader applicability to articulated object interactions.
中文: HOGS是一种新颖框架,通过结合3D高斯泼溅与物理感知优化,能从稀疏视角高效渲染出真实且物理合理的人-物交互效果,在渲染质量和性能上均优于现有方法。
English: HOGS is a novel framework that combines 3D Gaussian Splatting with physics-aware optimization to efficiently render realistic and physically plausible human-object interactions from sparse views, achieving superior quality and performance over existing methods.

Authors:Unnikrishnan Kunnath Ganesan, Giuseppe Durisi, Matteo Zecchin, Petar Popovski, Osvaldo Simeone
Title: Online Conformal Compression for Zero-Delay Communication with Distortion Guarantees
Abstract:
We investigate a lossy source compression problem in which both the encoder and decoder are equipped with a pre-trained sequence predictor. We propose an online lossy compression scheme that, under a 0-1 loss distortion function, ensures a deterministic, per-sequence upper bound on the distortion (outage) level for any time instant. The outage guarantees apply irrespective of any assumption on the distribution of the sequences to be encoded or on the quality of the predictor at the encoder and decoder. The proposed method, referred to as online conformal compression (OCC), is built upon online conformal prediction--a novel method for constructing confidence intervals for arbitrary predictors. Numerical results show that OCC achieves a compression rate comparable to that of an idealized scheme in which the encoder, with hindsight, selects the optimal subset of symbols to describe to the decoder, while satisfying the overall outage constraint.
Chinese: 本文提出了一种在线保形压缩方法,在无损预测器的基础上确保有损源压缩的确定性失真保证,不受序列分布或预测器质量影响,实现了接近最优的压缩率。
English: This paper introduces an online conformal compression method that ensures deterministic distortion guarantees for lossy source compression using pre-trained predictors, regardless of sequence distribution or predictor quality, achieving near-optimal compression rates.

Authors:Guanrong Li, Haolin Yang, Xinyu Liu, Zhen Wu, Xinyu Dai
Title: Counterfactual Language Reasoning for Explainable Recommendation Systems
Abstract:
Explainable recommendation systems leverage transparent reasoning to foster user trust and improve decision-making processes. Current approaches typically decouple recommendation generation from explanation creation, violating causal precedence principles where explanatory factors should logically precede outcomes. This paper introduces a novel framework integrating structural causal models with large language models to establish causal consistency in recommendation pipelines. Our methodology enforces explanation factors as causal antecedents to recommendation predictions through causal graph construction and counterfactual adjustment. We particularly address the confounding effect of item popularity that distorts personalization signals in explanations, developing a debiasing mechanism that disentangles genuine user preferences from conformity bias. Through comprehensive experiments across multiple recommendation scenarios, we demonstrate that CausalX achieves superior performance in recommendation accuracy, explanation plausibility, and bias mitigation compared to baselines.
中文摘要:本文提出CausalX创新框架,通过结合结构因果模型与大语言模型,将解释因素作为推荐预测的因果前提来保证因果一致性,并有效消除商品流行度对个性化解释的干扰。
English Summary: This paper introduces CausalX, a novel framework that integrates structural causal models with large language models to ensure causal consistency in explainable recommendations by treating explanation factors as causal antecedents to predictions and mitigating item popularity bias.

Authors:Fan Yin, Zifeng Wang, I-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long T. Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, Tomas Pfister
Title: Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
Abstract:
Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.
中文: Magnet框架通过基于图的建模和对比学习生成训练轨迹,增强大语言模型在多轮对话中的函数调用能力,其140亿参数模型在基准测试中大幅超越Gemini-1.5-pro-002。
English: The Magnet framework enhances LLMs' multi-turn function calling by generating training trajectories through graph-based modeling and contrastive learning, with its 14B model significantly outperforming Gemini-1.5-pro-002 in benchmarks.

Authors:Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, Zhenning Li
Title: CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting
Abstract:
Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.
Chinese: CoT-Drive提出了一种创新的自动驾驶运动预测方法,通过结合大语言模型的思维链提示和知识蒸馏技术,在边缘设备上实现实时运行,同时利用增强的交通场景语义理解显著提升预测精度。
English: CoT-Drive introduces a novel motion forecasting method for autonomous driving that utilizes large language models with chain-of-thought prompting and knowledge distillation to enable real-time operation on edge devices while improving prediction accuracy through enhanced semantic understanding of traffic scenes.

Authors:Longchao Da, Tiejin Chen, Zhuoheng Li, Shreyas Bachiraju, Huaiyuan Yao, Li Li, Yushun Dong, Xiyang Hu, Zhengzhong Tu, Dongjie Wang, Yue Zhao, Ben Zhou, Ram Pendyala, Benjamin Stabler, Yezhou Yang, Xuesong Zhou, Hua Wei
Title: Generative AI in Transportation Planning: A Survey
Abstract:
The integration of generative artificial intelligence (GenAI) into transportation planning has the potential to revolutionize tasks such as demand forecasting, infrastructure design, policy evaluation, and traffic simulation. However, there is a critical need for a systematic framework to guide the adoption of GenAI in this interdisciplinary domain. In this survey, we, a multidisciplinary team of researchers spanning computer science and transportation engineering, present the first comprehensive framework for leveraging GenAI in transportation planning. Specifically, we introduce a new taxonomy that categorizes existing applications and methodologies into two perspectives: transportation planning tasks and computational techniques. From the transportation planning perspective, we examine the role of GenAI in automating descriptive, predictive, generative, simulation, and explainable tasks to enhance mobility systems. From the computational perspective, we detail advancements in data preparation, domain-specific fine-tuning, and inference strategies, such as retrieval-augmented generation and zero-shot learning tailored to transportation applications. Additionally, we address critical challenges, including data scarcity, explainability, bias mitigation, and the development of domain-specific evaluation frameworks that align with transportation goals like sustainability, equity, and system efficiency. This survey aims to bridge the gap between traditional transportation planning methodologies and modern AI techniques, fostering collaboration and innovation. By addressing these challenges and opportunities, we seek to inspire future research that ensures ethical, equitable, and impactful use of generative AI in transportation planning.
中文: 本调查首次提出将生成式人工智能融入交通规划的综合框架,涵盖关键应用、计算技术及数据稀缺等挑战,旨在弥合传统方法与现代AI的差距,推动符合伦理且具影响力的创新。
English: This survey introduces the first comprehensive framework for integrating generative AI into transportation planning, addressing key applications, computational techniques, and challenges like data scarcity and bias to bridge traditional methods with modern AI for ethical and impactful innovation.

Authors:Zhiyuan Ning, Zaitian Wang, Ran Zhang, Ping Xu, Kunpeng Liu, Pengyang Wang, Wei Ju, Pengfei Wang, Yuanchun Zhou, Erik Cambria, Chong Chen
Title: Deep Cut-informed Graph Embedding and Clustering
Abstract:
Graph clustering aims to divide the graph into different clusters. The recently emerging deep graph clustering approaches are largely built on graph neural networks (GNN). However, GNN is designed for general graph encoding and there is a common issue of representation collapse in existing GNN-based deep graph clustering algorithms. We attribute two main reasons for such issues: (i) the inductive bias of GNN models: GNNs tend to generate similar representations for proximal nodes. Since graphs often contain a non-negligible amount of inter-cluster links, the bias results in error message passing and leads to biased clustering; (ii) the clustering guided loss function: most traditional approaches strive to make all samples closer to pre-learned cluster centers, which causes a degenerate solution assigning all data points to a single label thus making all samples similar and less discriminative. To address these challenges, we investigate graph clustering from a graph cut perspective and propose an innovative and non-GNN-based Deep Cut-informed Graph embedding and Clustering framework, namely DCGC. This framework includes two modules: (i) cut-informed graph encoding; (ii) self-supervised graph clustering via optimal transport. For the encoding module, we derive a cut-informed graph embedding objective to fuse graph structure and attributes by minimizing their joint normalized cut. For the clustering module, we utilize the optimal transport theory to obtain the clustering assignments, which can balance the guidance of "proximity to the pre-learned cluster center". With the above two tailored designs, DCGC is more suitable for the graph clustering task, which can effectively alleviate the problem of representation collapse and achieve better performance. We conduct extensive experiments to demonstrate that our method is simple but effective compared with benchmarks.
中文: 基于图神经网络的深度图聚类方法常因归纳偏差和聚类损失函数导致表示塌陷,为此提出的DCGC框架采用割信息编码和最优传输聚类,有效缓解了这一问题并提升了性能。
English: Deep graph clustering methods based on GNNs often suffer from representation collapse due to inductive bias and clustering loss functions, prompting the development of DCGC—a non-GNN framework using cut-informed encoding and optimal transport clustering to enhance performance.

Authors:Francisco de Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan C Burguillo
Title: Identification and explanation of disinformation in wiki data streams
Abstract:
Social media platforms, increasingly used as news sources for varied data analytics, have transformed how information is generated and disseminated. However, the unverified nature of this content raises concerns about trustworthiness and accuracy, potentially negatively impacting readers' critical judgment due to disinformation. This work aims to contribute to the automatic data quality validation field, addressing the rapid growth of online content on wiki pages. Our scalable solution includes stream-based data processing with feature engineering, feature analysis and selection, stream-based classification, and real-time explanation of prediction outcomes. The explainability dashboard is designed for the general public, who may need more specialized knowledge to interpret the model's prediction. Experimental results on two datasets attain approximately 90 % values across all evaluation metrics, demonstrating robust and competitive performance compared to works in the literature. In summary, the system assists editors by reducing their effort and time in detecting disinformation.
Chinese: 本研究开发了一个可扩展的系统,通过流式数据处理和可解释性仪表盘自动验证维基页面的数据质量,帮助公众识别虚假信息,各项评估指标达到约90%的准确率,有效减轻编辑人员的工作负担。
English: This study develops a scalable system for automatically validating data quality on wiki pages, using stream-based processing and an explainable dashboard to help the public detect disinformation with about 90% accuracy across metrics, thereby reducing editors' workload.

Authors:Jungho Lee, Donghyeong Kim, Dogyoon Lee, Suhwan Cho, Minhyeok Lee, Wonjoon Lee, Taeoh Kim, Dongyoon Wee, Sangyoun Lee
Title: CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Abstract:
3D Gaussian Splatting (3DGS) has gained significant attention due to its high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.
Chinese: 本研究提出CoMoGaussian方法,通过连续运动感知的高斯泼溅和神经常微分方程,从运动模糊图像中精确重建3D场景,在实时渲染中取得了最先进的性能。
English: This study introduces CoMoGaussian, a method that uses continuous motion-aware Gaussian splatting and neural ODEs to reconstruct accurate 3D scenes from motion-blurred images, achieving state-of-the-art results in real-time rendering.

Authors:Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
Title: STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
Abstract:
Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm
Chinese: STORM提出了一种新颖的视频理解架构,通过集成基于Mamba状态空间模型的时间编码器来增强帧间动态表征,在长视频基准测试中实现最优性能,同时将计算成本降低高达8倍、延迟减少2.4-2.9倍。
English: STORM introduces a novel video understanding architecture that integrates a temporal encoder using the Mamba State Space Model to enrich frame representations with temporal dynamics, achieving state-of-the-art performance on long video benchmarks while reducing computational costs by up to 8 times and latency by 2.4-2.9 times.

Authors:Suhwan Cho, Seunghoon Lee, Minhyeok Lee, Jungho Lee, Sangyoun Lee
Title: Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation
Abstract:
Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.
中文: FindTrack提出一种解耦框架,将目标识别与掩码传播分离,有效减少歧义并提升分割一致性,在公开基准测试中显著优于现有方法。
English: FindTrack introduces a decoupled framework that separates target identification from mask propagation, effectively reducing ambiguities and enhancing segmentation consistency to outperform existing methods on benchmarks.

Authors:Haiduo Huang, Fuwei Yang, Dong Li, Ji Liu, Lu Tian, Jinzhang Peng, Pengju Ren, Emad Barsoum
Title: Partial Convolution Meets Visual Attention
Abstract:
Designing an efficient and effective neural network has remained a prominent topic in computer vision research. Depthwise onvolution (DWConv) is widely used in efficient CNNs or ViTs, but it needs frequent memory access during inference, which leads to low throughput. FasterNet attempts to introduce partial convolution (PConv) as an alternative to DWConv but compromises the accuracy due to underutilized channels. To remedy this shortcoming and consider the redundancy between feature map channels, we introduce a novel Partial visual ATtention mechanism (PAT) that can efficiently combine PConv with visual attention. Our exploration indicates that the partial attention mechanism can completely replace the full attention mechanism and reduce model parameters and FLOPs. Our PAT can derive three types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp) and Partial Self-Attention block (PAT_sf). First, PAT_ch integrates the enhanced Gaussian channel attention mechanism to infuse global distribution information into the untouched channels of PConv. Second, we introduce the spatial-wise attention to the MLP layer to further improve model accuracy. Finally, we replace PAT_ch in the last stage with the self-attention mechanism to extend the global receptive field. Building upon PAT, we propose a novel hybrid network family, named PATNet, which achieves superior top-1 accuracy and inference speed compared to FasterNet on ImageNet-1K classification and excel in both detection and segmentation on the COCO dataset. Particularly, our PATNet-T2 achieves 1.3% higher accuracy than FasterNet-T2, while exhibiting 25% higher GPU throughput and 24% lower CPU latency.
中文摘要:提出的部分视觉注意力(PAT)机制将部分卷积与视觉注意力相结合,构建的PATNet混合网络在ImageNet-1K和COCO数据集上实现了比FasterNet更优的精度与推理速度。
English Summary: The proposed Partial visual ATtention (PAT) mechanism combines partial convolution with visual attention to create PATNet, a hybrid network that outperforms FasterNet in accuracy and speed on ImageNet-1K and COCO datasets.

Authors:Jianqiao Chen, Nan Ma, Wenkai Liu, Xiaodong Xu, Ping Zhang
Title: Continual Learning-Aided Super-Resolution Scheme for Channel Reconstruction and Generalization in OFDM Systems
Abstract:
Channel reconstruction and generalization capability are of equal importance for developing channel estimation schemes within deep learning (DL) framework. In this paper, we exploit a novel DL-based scheme for efficient OFDM channel estimation where the neural networks for channel reconstruction and generalization are respectively designed. For the former, we propose a dual-attention-aided super-resolution neural network (DA-SRNN) to map the channels at pilot positions to the whole time-frequency channels. Specifically, the channel-spatial attention mechanism is first introduced to sequentially infer attention maps along two separate dimensions corresponding to two types of underlying channel correlations, and then the lightweight SR module is developed for efficient channel reconstruction. For the latter, we introduce continual learning (CL)-aided training strategies to make the neural network adapt to different channel distributions. Specifically, the elastic weight consolidation (EWC) is introduced as the regularization term in regard to loss function of channel reconstruction, which can constrain the direction and space of updating the important weights of neural networks among different channel distributions. Meanwhile, the corresponding training process is provided in detail. By evaluating under 3rd Generation Partnership Project (3GPP) channel models, numerical results verify the superiority of the proposed channel estimation scheme with significantly improved channel reconstruction and generalization performance over counterparts.
中文摘要:本文提出了一种基于双重注意力的超分辨率神经网络进行OFDM信道重建,并采用持续学习策略增强对不同信道分布的泛化能力,在3GPP标准信道模型下的测试表明该方案显著优于现有方法。
English Summary: This paper proposes a dual-attention-aided super-resolution neural network for OFDM channel reconstruction and incorporates continual learning strategies to enhance generalization across varying channel distributions, demonstrating superior performance in 3GPP evaluations.

Authors:Haojun Chen, Minghao Liu, Chengdong Ma, Xiaojian Ma, Zailin Ma, Huimin Wu, Yuanpei Chen, Yifan Zhong, Mingzhi Wang, Qing Li, Yaodong Yang
Title: Falcon: Fast Visuomotor Policies via Partial Denoising
Abstract:
Diffusion policies are widely adopted in complex visuomotor tasks for their ability to capture multimodal action distributions. However, the multiple sampling steps required for action generation significantly harm real-time inference efficiency, which limits their applicability in real-time decision-making scenarios. Existing acceleration techniques either require retraining or degrade performance under low sampling steps. Here we propose Falcon, which mitigates this speed-performance trade-off and achieves further acceleration. The core insight is that visuomotor tasks exhibit sequential dependencies between actions. Falcon leverages this by reusing partially denoised actions from historical information rather than sampling from Gaussian noise at each step. By integrating current observations, Falcon reduces sampling steps while preserving performance. Importantly, Falcon is a training-free algorithm that can be applied as a plug-in to further improve decision efficiency on top of existing acceleration techniques. We validated Falcon in 48 simulated environments and 2 real-world robot experiments. demonstrating a 2-7x speedup with negligible performance degradation, offering a promising direction for efficient visuomotor policy design.
中文: Falcon是一种无需训练的算法,通过复用历史部分去噪动作来加速视觉运动任务中的扩散策略,在仿真和真实实验中实现了2-7倍加速且性能几乎无损。
English: Falcon is a training-free algorithm that accelerates diffusion policies in visuomotor tasks by reusing historical partially denoised actions, achieving 2-7x speedup with minimal performance loss across simulations and real-world tests.

Authors:Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang
Title: HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Abstract:
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
中文:HumanDreamer提出解耦框架,先通过基于MotionVid数据集训练的MotionDiT从文本生成多样化人体姿态,再驱动高质量人体运动视频生成,在FID和R-precision指标上显著提升,并能支持姿态序列预测等下游任务。
English: HumanDreamer introduces a decoupled framework that first generates diverse human poses from text prompts using MotionDiT trained on the MotionVid dataset, then produces high-quality human-motion videos with improved FID and R-precision metrics, enabling versatile downstream applications.

Authors:Chaojian Li, Sixu Li, Linrui Jiang, Jingqun Zhang, Yingyan Celine Lin
Title: Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers
Abstract:
Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However, achieving the desired real-time rendering speeds for immersive interactions is still hindered by (1) the lack of a universal algorithmic solution for different application scenarios and (2) the dedication of existing devices or accelerators to merely specific rendering pipelines. To overcome this challenge, we have developed a unified neural rendering accelerator that caters to a wide array of typical neural rendering pipelines, enabling real-time and on-device rendering across different applications while maintaining both efficiency and compatibility. Our accelerator design is based on the insight that, although neural rendering pipelines vary and their algorithm designs are continually evolving, they typically share common operators, predominantly executing similar workloads. Building on this insight, we propose a reconfigurable hardware architecture that can dynamically adjust dataflow to align with specific rendering metric requirements for diverse applications, effectively supporting both typical and the latest hybrid rendering pipelines. Benchmarking experiments and ablation studies on both synthetic and real-world scenes demonstrate the effectiveness of the proposed accelerator. The proposed unified accelerator stands out as the first solution capable of achieving real-time neural rendering across varied representative pipelines on edge devices, potentially paving the way for the next generation of neural graphics applications.
中文: 作者开发了一种统一的神经渲染加速器,通过利用通用算子和可重构架构支持多种渲染流程,实现了在边缘设备上针对不同应用的高效实时渲染。
English: The authors have developed a unified neural rendering accelerator that supports diverse pipelines by leveraging common operators and a reconfigurable architecture, enabling real-time, efficient rendering on edge devices for various applications.

Authors:Guanqiao Qu, Qian Chen, Xianhao Chen, Kaibin Huang, Yuguang Fang
Title: PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference
Abstract:
By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at the network edge. However, achieving high task throughput with stringent latency requirements remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially is equivalent to solving the original problem. Due to the NP-hardness of the problem, we first study an important special case called the "bottom-layer-sharing" case, where AI models share some bottom layers within clusters, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, where shared parameter blocks appear at arbitrary positions within AI models, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.
中文摘要:提出的PartialLoading框架通过优化多用户调度和带宽分配,利用AI模型间的共享参数减少加载时间,从而在截止期限约束下显著提升了边缘推理的任务吞吐量。
English Summary: The proposed PartialLoading framework enhances edge inference task throughput by optimizing multi-user scheduling and bandwidth allocation to exploit shared parameters across AI models, thereby reducing model loading delays.

Authors:Ryuta Nagahama, Weiwei Wan, Zhengtao Hu, Kensuke Harada
Title: Bimanual Regrasp Planning and Control for Active Reduction of Object Pose Uncertainty
Abstract:
Precisely grasping an object is a challenging task due to pose uncertainties. Conventional methods have used cameras and fixtures to reduce object uncertainty. They are effective but require intensive preparation, such as designing jigs based on the object geometry and calibrating cameras with high-precision tools fabricated using lasers. In this study, we propose a method to reduce the uncertainty of the position and orientation of a grasped object without using a fixture or a camera. Our method is based on the concept that the flat finger pads of a parallel gripper can reduce uncertainty along its opening/closing direction through flat surface contact. Three orthogonal grasps by parallel grippers with flat finger pads collectively constrain an object's position and orientation to a unique state. Guided by the concepts, we develop a regrasp planning and admittance control approach that sequentially finds and leverages three orthogonal grasps of two robotic arms to actively reduce uncertainties in the object pose. We evaluated the proposed method on different initial object uncertainties and verified that it had good repeatability. The deviation levels of the experimental trials were on the same order of magnitude as those of an optical tracking system, demonstrating strong relative inference performance.
中文: 本研究提出一种利用平行夹具的平面指垫通过三次正交抓取来降低物体位姿不确定性的方法,无需夹具或相机,且重复性良好,性能接近光学跟踪系统。
English: This study introduces a method using parallel grippers with flat finger pads to reduce object pose uncertainty through three orthogonal grasps, eliminating the need for fixtures or cameras and achieving high repeatability comparable to optical tracking systems.

Authors:Xinyi Yuan, Weiwei Wan, Kensuke Harada
Title: IKSel: Selecting Good Seed Joint Values for Fast Numerical Inverse Kinematics Iterations
Abstract:
This paper revisits the numerical inverse kinematics (IK) problem, leveraging modern computational resources and refining the seed selection process to develop a solver that is competitive with analytical-based methods. The proposed seed selection strategy consists of three key stages: (1) utilizing a K-Dimensional Tree (KDTree) to identify seed candidates based on workspace proximity, (2) sorting candidates by joint space adjustment and attempting numerical iterations with the one requiring minimal adjustment, and (3) re-selecting the most distant joint configurations for new attempts in case of failures. The joint space adjustment-based seed selection increases the likelihood of rapid convergence, while the re-attempt strategy effectively helps circumvent local minima and joint limit constraints. Comparison results with both traditional numerical solvers and learning-based methods demonstrate the strengths of the proposed approach in terms of success rate, time efficiency, and accuracy. Additionally, we conduct detailed ablation studies to analyze the effects of various parameters and solver settings, providing practical insights for customization and optimization. The proposed method consistently exhibits high success rates and computational efficiency. It is suitable for time-sensitive applications.
本文提出了一种改进的数值逆运动学求解器,通过工作空间邻近性和关节空间调整优化种子选择,实现了高成功率和实时应用的高效性。
This paper introduces an enhanced numerical inverse kinematics solver that improves seed selection through workspace proximity and joint space adjustments, achieving high success rates and efficiency for real-time applications.

Authors:Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin
Title: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Abstract:
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/
中文: 本文提出的CoT-VLA模型通过自回归预测未来图像帧的视觉思维链推理机制,显著提升了视觉语言动作模型在复杂操作任务中的性能,在真实世界和仿真环境中分别超越现有最优模型17%和6%。
English: The paper introduces CoT-VLA, a vision-language-action model that enhances manipulation tasks by incorporating explicit visual chain-of-thought reasoning through autoregressive prediction of future image frames, achieving significant performance improvements over existing models.

Authors:Robin Dietrich, Tobias Fischer, Nicolai Waniek, Nico Reeb, Michael Milford, Alois Knoll, Adam D. Hines
Title: Threshold Adaptation in Spiking Networks Enables Shortest Path Finding and Place Disambiguation
Abstract:
Efficient spatial navigation is a hallmark of the mammalian brain, inspiring the development of neuromorphic systems that mimic biological principles. Despite progress, implementing key operations like back-tracing and handling ambiguity in bio-inspired spiking neural networks remains an open challenge. This work proposes a mechanism for activity back-tracing in arbitrary, uni-directional spiking neuron graphs. We extend the existing replay mechanism of the spiking hierarchical temporal memory (S-HTM) by our spike timing-dependent threshold adaptation (STDTA), which enables us to perform path planning in networks of spiking neurons. We further present an ambiguity dependent threshold adaptation (ADTA) for identifying places in an environment with less ambiguity, enhancing the localization estimate of an agent. Combined, these methods enable efficient identification of the shortest path to an unambiguous target. Our experiments show that a network trained on sequences reliably computes shortest paths with fewer replays than the steps required to reach the target. We further show that we can identify places with reduced ambiguity in multiple, similar environments. These contributions advance the practical application of biologically inspired sequential learning algorithms like the S-HTM towards neuromorphic localization and navigation.
中文摘要:本研究提出了一种在脉冲神经网络中进行活动回溯的新机制,通过结合脉冲时序依赖阈值适应和模糊度依赖阈值适应,实现了神经形态系统中高效的路径规划和改进的定位能力。
English Summary: This study introduces a novel mechanism for activity back-tracing in spiking neural networks, combining spike timing-dependent threshold adaptation and ambiguity-dependent threshold adaptation to enable efficient path planning and improved localization in neuromorphic systems.

Authors:Ryunosuke Takebayashi, Vitor Hideyo Isume, Takuya Kiyokawa, Weiwei Wan, Kensuke Harada
Title: Cooking Task Planning using LLM and Verified by Graph Network
Abstract:
Cooking tasks remain a challenging problem for robotics due to their complexity. Videos of people cooking are a valuable source of information for such task, but introduces a lot of variability in terms of how to translate this data to a robotic environment. This research aims to streamline this process, focusing on the task plan generation step, by using a Large Language Model (LLM)-based Task and Motion Planning (TAMP) framework to autonomously generate cooking task plans from videos with subtitles, and execute them. Conventional LLM-based task planning methods are not well-suited for interpreting the cooking video data due to uncertainty in the videos, and the risk of hallucination in its output. To address both of these problems, we explore using LLMs in combination with Functional Object-Oriented Networks (FOON), to validate the plan and provide feedback in case of failure. This combination can generate task sequences with manipulation motions that are logically correct and executable by a robot. We compare the execution of the generated plans for 5 cooking recipes from our approach against the plans generated by a few-shot LLM-only approach for a dual-arm robot setup. It could successfully execute 4 of the plans generated by our approach, whereas only 1 of the plans generated by solely using the LLM could be executed.
中文: 本研究提出了一种结合大型语言模型与功能对象导向网络的混合方法,可从带字幕的烹饪视频中生成可执行的任务计划,在五个食谱测试中成功执行了四个,而仅使用语言模型的方法仅能执行一个。
English: This research introduces a hybrid approach combining Large Language Models with Functional Object-Oriented Networks to generate executable cooking task plans from subtitled videos, successfully executing 4 out of 5 recipes compared to only 1 with LLM-only methods.

Authors:Mohammad R. Hajidavalloo, Kaixiang Zhang, Vaibhav Srivastava, Zhaojian Li
Title: Model-free Vehicle Rollover Prevention: A Data-driven Predictive Control Approach
Abstract:
Vehicle rollovers pose a significant safety risk and account for a disproportionately high number of fatalities in road accidents. This paper addresses the challenge of rollover prevention using Data-EnablEd Predictive Control (DeePC), a data-driven control strategy that directly leverages raw input-output data to maintain vehicle stability without requiring explicit system modeling. To enhance computational efficiency, we employ a reduced-dimension DeePC that utilizes singular value decomposition-based dimension reduction to significantly lower computation complexity without compromising control performance. This optimization enables real-time application in scenarios with high-dimensional data, making the approach more practical for deployment in real-world vehicles. The proposed approach is validated through high-fidelity CarSim simulations in both sedan and utility truck scenarios, demonstrating its versatility and ability to maintain vehicle stability under challenging driving conditions. Comparative results with Linear Model Predictive Control (LMPC) highlight the superior performance of DeePC in preventing rollovers while preserving maneuverability. The findings suggest that DeePC offers a robust and adaptable solution for rollover prevention, capable of handling varying road and vehicle conditions.
中文: 本文提出了一种基于降维数据驱动预测控制的方法来预防车辆侧翻,通过仿真验证了该方法相比传统控制策略在实时性和稳定性维护方面具有更优性能。
English: This paper presents a data-driven control strategy using dimension-reduced Data-EnablEd Predictive Control (DeePC) to prevent vehicle rollovers, which demonstrates superior real-time performance and stability maintenance over traditional methods in simulations.

Authors:Zehui Liao, Shishuai Hu, Ke Zou, Huazhu Fu, Liangli Zhen, Yong Xia
Title: Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering
Abstract:
Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.
中文: 多模态大语言模型在医学视觉问答中潜力显著但存在幻觉问题,VASE通过弱图像变换和增强视觉输入来提升检测效果,优于现有方法。
English: Multimodal large language models (MLLMs) show promise in medical visual question answering but suffer from hallucinations, which VASE addresses by using weak image transformations and amplifying visual input to improve detection accuracy.

Authors:Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan
Title: SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Abstract:
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.
中文:SlowFast-LLaVA-1.5是一个令牌高效的视频大模型系列,通过简化的SlowFast双流训练流程,在长视频理解任务中实现了最先进的性能,即使在适合移动端的1B-3B小规模模型上也能保持卓越表现。
English: SlowFast-LLaVA-1.5 is a token-efficient family of video large language models that achieves state-of-the-art performance in long-form video understanding through a streamlined SlowFast training pipeline, even at small 1B-3B scales suitable for mobile applications.

Authors:Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, Bin Cui
Title: Training-free Diffusion Acceleration with Bottleneck Sampling
Abstract:
Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.
中文摘要:Bottleneck Sampling是一种无需重新训练的高效框架,通过采用“高-低-高”去噪流程利用低分辨率先验,在保持图像和视频生成质量的同时,将推理速度分别提升至3倍和2.5倍。
English Summary: Bottleneck Sampling is a training-free framework that accelerates diffusion model inference by leveraging low-resolution priors in a high-low-high denoising workflow, achieving up to 3× speedup for images and 2.5× for videos without compromising output quality.

Authors:Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Chen Wang, Jiahua Dong, Wangbo Yu, Ge Zhang, Jun Song, Xiang Li, Bo Zheng, Ian Reid, Xiaodan Liang
Title: Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
Abstract:
Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in videos remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation in video contexts. Our work differs from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the video's explicit narrative; 2) Multi-hop fact-seeking question: Each question involves multiple explicit facts and requires strict factual grounding without hypothetical or subjective inferences. We also include per-hop single-fact-based sub-QAs alongside final QAs to enable fine-grained, stepby-step evaluation; 3) Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance; 4) Temporal grounded required: Requiring answers to rely on one or more temporal segments in videos, rather than single frames. We extensively evaluate 33 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, with the best-performing model o3 merely achieving an F-score of 66.3%; 2) Most LVLMs are overconfident in what they generate, with self-stated confidence exceeding actual accuracy; 3) Retrieval-augmented generation demonstrates consistent improvements at the cost of additional inference time overhead; 4) Multi-hop QA demonstrates substantially degraded performance compared to single-hop sub-QAs, with first-hop object or event recognition emerging as the primary bottleneck. We position Video SimpleQA as the cornerstone benchmark for video factuality assessment, aiming to steer LVLM development toward verifiable grounding in real-world contexts.
Chinese: Video SimpleQA作为首个针对视频事实性评估的综合基准,揭示了大型视频语言模型在事实依据方面的显著不足,尽管在多模态理解方面取得了进展。
English: Video SimpleQA is introduced as the first comprehensive benchmark to evaluate the factuality of Large Video Language Models, revealing their significant deficiencies in factual grounding despite advancements in multi-modal understanding.

Authors:Guosheng Zhao, Xiaofeng Wang, Chaojun Ni, Zheng Zhu, Wenkang Qin, Guan Huang, Xingang Wang
Title: ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
Abstract:
Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface. Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, and the optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1% increase in NTA-IoU, a 23. 0% improvement in FID, and a remarkable 4.5% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.
中文摘要:ReconDreamer++通过引入NTDNet来弥合生成数据与真实传感器观测之间的领域差距,并利用3D高斯中的几何先验知识优化结构化元素(如路面)的渲染,显著提升了自动驾驶闭环模拟的保真度。
English Summary: ReconDreamer++ is an enhanced framework that improves autonomous driving simulation by introducing NTDNet to bridge the domain gap between generated and real sensor data, while preserving geometric priors in 3D Gaussians to better render structured elements like road surfaces.

Authors:Qingshan Hou, Meng Wang, Peng Cao, Zou Ke, Xiaoli Liu, Huazhu Fu, Osmar R. Zaiane
Title: FundusGAN: A Hierarchical Feature-Aware Generative Framework for High-Fidelity Fundus Image Generation
Abstract:
Recent advancements in ophthalmology foundation models such as RetFound have demonstrated remarkable diagnostic capabilities but require massive datasets for effective pre-training, creating significant barriers for development and deployment. To address this critical challenge, we propose FundusGAN, a novel hierarchical feature-aware generative framework specifically designed for high-fidelity fundus image synthesis. Our approach leverages a Feature Pyramid Network within its encoder to comprehensively extract multi-scale information, capturing both large anatomical structures and subtle pathological features. The framework incorporates a modified StyleGAN-based generator with dilated convolutions and strategic upsampling adjustments to preserve critical retinal structures while enhancing pathological detail representation. Comprehensive evaluations on the DDR, DRIVE, and IDRiD datasets demonstrate that FundusGAN consistently outperforms state-of-the-art methods across multiple metrics (SSIM: 0.8863, FID: 54.2, KID: 0.0436 on DDR). Furthermore, disease classification experiments reveal that augmenting training data with FundusGAN-generated images significantly improves diagnostic accuracy across multiple CNN architectures (up to 6.49\% improvement with ResNet50). These results establish FundusGAN as a valuable foundation model component that effectively addresses data scarcity challenges in ophthalmological AI research, enabling more robust and generalizable diagnostic systems while reducing dependency on large-scale clinical data collection.
中文摘要:FundusGAN是一种创新的分层特征感知生成框架,通过合成高保真眼底图像有效解决眼科AI数据稀缺问题,显著提升诊断准确性并减少对大规模临床数据收集的依赖。
English Summary: FundusGAN is a novel generative framework that synthesizes high-fidelity fundus images to overcome data scarcity in ophthalmology AI, significantly enhancing diagnostic accuracy and reducing reliance on large clinical datasets.

Authors:Yanan Ma, Zhengru Fang, Longzhi Yuan, Yiqin Deng, Xianhao Chen, Yuguang Fang
Title: RAISE: Optimizing RIS Placement to Maximize Task Throughput in Multi-Server Vehicular Edge Computing
Abstract:
Given the limited computing capabilities on autonomous vehicles, onboard processing of large volumes of latency-sensitive tasks presents significant challenges. While vehicular edge computing (VEC) has emerged as a solution, offloading data-intensive tasks to roadside servers or other vehicles is hindered by large obstacles like trucks/buses and the surge in service demands during rush hours. To address these challenges, Reconfigurable Intelligent Surface (RIS) can be leveraged to mitigate interference from ground signals and reach more edge servers by elevating RIS adaptively. To this end, we propose RAISE, an optimization framework for RIS placement in multi-server VEC systems. Specifically, RAISE optimizes RIS altitude and tilt angle together with the optimal task assignment to maximize task throughput under deadline constraints. To find a solution, a two-layer optimization approach is proposed, where the inner layer exploits the unimodularity of the task assignment problem to derive the efficient optimal strategy while the outer layer develops a near-optimal hill climbing (HC) algorithm for RIS placement with low complexity. Extensive experiments demonstrate that the proposed RAISE framework consistently outperforms existing benchmarks.
中文摘要:RAISE框架通过优化可重构智能表面的部署高度与倾斜角度,结合任务分配策略,有效提升车载边缘计算系统的任务处理效率,克服信号干扰和服务器接入瓶颈。
English Summary: The RAISE framework optimizes RIS placement and task assignment in vehicular edge computing to enhance task throughput by overcoming signal interference and server accessibility issues.

Authors:Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, Min Xu
Title: Visual Variational Autoencoder Prompt Tuning
Abstract:
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V$^2$APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V$^2$APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V$^2$APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.
Chinese Summary: V²APT通过变分自编码器架构生成动态输入相关提示,在多个基准测试中显著超越了现有参数高效微调方法。
English Summary: V²APT introduces a dynamic prompt tuning framework using a variational autoencoder to generate input-specific prompts, significantly outperforming existing parameter-efficient fine-tuning methods across multiple benchmarks.

Authors:Jonas Wallat, Abdelrahman Abdallah, Adam Jatowt, Avishek Anand
Title: A Study into Investigating Temporal Robustness of LLMs
Abstract:
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.
中文: 大型语言模型在理解时间信息和进行时间推理方面存在不足,但通过本研究设计的测试方法,可将其时间敏感性问答性能提升高达55%。
English: Large language models lack temporal robustness in understanding and reasoning with time-sensitive information, but this study's tests can improve their temporal question-answering performance by up to 55%.

Authors:Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, Deyu Zhang
Title: KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse
Abstract:
Recent advances in long-text understanding have pushed the context length of large language models (LLMs) up to one million tokens. It boosts LLMs's accuracy and reasoning capacity but causes exorbitant computational costs and unsatisfactory Time to First Token (TTFT). KV cache reuse, which reuses the exact same KV cache of prefixes and templates or shares similar ones but with extra selective recomputation, offers a promising way to tackle this issue. However, prior studies overlook the cross-request KV reuse and the attention deviations introduced by new tokens during the decoding stage. In this paper, we present a KV cache management module that shares the KV cache across requests under multi-tenant scenarios without sacrificing model accuracy. Our system, KVShare, enables accurate and efficient LLM serving by 1) a Dual-Stage High Deviation algorithm (DHD) that conditionally selects a small portion of KV cache to be recomputed during both prefill and decode phases, and 2) a cache-aware scheduler that prioritizes requests based on their KV cache hit rates and orchestrates continuous batching to achieve enhanced system efficiency and faster TTFT. Multi-task experiments conducted on models such as Qwen2.5-7B,Llama3.1-8B and Yi1.5-9B demonstrate that KVShare reduces TTFT by up to 9.39x and increases 1.2x of the throughput compared to the full KV recompute. Moreover, KVShare achieves 20.38% boost in terms of accuracy compared to SOTA methods.
中文: 长文本理解的最新进展将大型语言模型的上下文长度提升至百万令牌,但这导致高昂计算成本和缓慢的首令牌时间,KVShare通过跨请求共享KV缓存并采用双阶段高偏差算法选择性重算部分缓存,有效提升了系统效率和准确性。
English: Recent advances in long-text understanding have increased LLMs' context length to one million tokens, but this leads to high computational costs and slow Time to First Token (TTFT), which KVShare addresses by sharing KV cache across requests and using a Dual-Stage High Deviation algorithm to selectively recompute parts of the cache, improving efficiency and accuracy.

Authors:Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Haris Khan, Yara Mahmoud, Luis Moreno, Dzmitry Tsetserukou
Title: LLM-Glasses: GenAI-driven Glasses with Haptic Feedback for Navigation of Visually Impaired People
Abstract:
We present LLM-Glasses, a wearable navigation system designed to assist visually impaired individuals by combining haptic feedback, YOLO-World object detection, and GPT-4o-driven reasoning. The system delivers real-time tactile guidance via temple-mounted actuators, enabling intuitive and independent navigation. Three user studies were conducted to evaluate its effectiveness: (1) a haptic pattern recognition study achieving an 81.3% average recognition rate across 13 distinct patterns, (2) a VICON-based navigation study in which participants successfully followed predefined paths in open spaces, and (3) an LLM-guided video evaluation demonstrating 91.8% accuracy in open scenarios, 84.6% with static obstacles, and 81.5% with dynamic obstacles. These results demonstrate the system's reliability in controlled environments, with ongoing work focusing on refining its responsiveness and adaptability to diverse real-world scenarios. LLM-Glasses showcases the potential of combining generative AI with haptic interfaces to empower visually impaired individuals with intuitive and effective mobility solutions.
中文: LLM-Glasses是一款结合触觉反馈、YOLO-World物体检测和GPT-4o推理的可穿戴导航系统,通过镜腿上的执行器为视障人士提供实时触觉引导,用户研究验证了其在受控环境中的可靠性,并正针对现实场景优化响应能力。
English: LLM-Glasses is a wearable navigation system for the visually impaired that integrates haptic feedback, YOLO-World object detection, and GPT-4o reasoning to provide real-time tactile guidance, with user studies confirming its reliability in controlled environments and ongoing improvements for real-world adaptability.

Authors:Haolin Yang, Feilong Tang, Ming Hu, Qingyu Yin, Yulong Li, Yexin Liu, Zelin Peng, Peng Gao, Junjun He, Zongyuan Ge, Imran Razzak
Title: ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos
Abstract:
Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.
中文摘要:视频扩散模型通过推理时搜索最佳初始噪声的ScalingNoise策略,在保持多样性的同时提升视频内容一致性和生成质量,显著减少噪声导致的错误。
English Summary: Video diffusion models can enhance video quality by identifying optimal initial noises during inference, using a reward-based search strategy called ScalingNoise that maintains content consistency and diversity while reducing errors.

Authors:Tzu-Yun Tseng, Alexey Nekrasov, Malcolm Burdorf, Bastian Leibe, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: Panoptic-CUDAL Technical Report: Rural Australia Point Cloud Dataset in Rainy Conditions
Abstract:
Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favorable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed. Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often. Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation. We introduce the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario. We present analysis of the recorded data and provide baseline results for panoptic and semantic segmentation methods on LiDAR point clouds. The dataset can be found here: https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems/
中文摘要:Panoptic-CUDAL数据集填补了自动驾驶在乡村和雨天环境下的数据空白,通过提供高分辨率激光雷达、摄像头及位姿数据,增强系统在复杂场景中的感知与导航能力。
English Summary: The Panoptic-CUDAL dataset addresses the gap in autonomous driving data for rural areas and rainy conditions by providing high-resolution LiDAR, camera, and pose data to improve environmental perception and navigation.

Authors:Zijian Li, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Rui Wang
Title: Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data
Abstract:
Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models. Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q\&A for fine-grained analysis and 50k Q\&A for reasoning enhancement. The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks. Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.
中文: 本文提出函数链(CoF)方法,通过探索函数路径生成精准多样的图表推理数据,构建的ChartCoF数据集不仅能提升多模态大语言模型的推理性能,还支持细粒度评估,为超越图表领域的应用提供新范式。
English: The paper introduces Chain of Functions (CoF), a programmatic reasoning pipeline that generates precise and diverse chart reasoning data by exploring function chains, leading to the creation of the ChartCoF dataset which enhances multimodal large language models' performance and enables fine-grained evaluation.

Authors:Hiroki Hanai, Takuya Kiyokawa, Weiwei Wan, Kensuke Harada
Title: Robotic Paper Wrapping by Learning Force Control
Abstract:
Robotic packaging using wrapping paper poses significant challenges due to the material's complex deformation properties. The packaging process itself involves multiple steps, primarily categorized as folding the paper or creating creases. Small deviations in the robot's arm trajectory or force vector can lead to tearing or wrinkling of the paper, exacerbated by the variability in material properties. This study introduces a novel framework that combines imitation learning and reinforcement learning to enable a robot to perform each step of the packaging process efficiently. The framework allows the robot to follow approximate trajectories of the tool-center point (TCP) based on human demonstrations while optimizing force control parameters to prevent tearing or wrinkling, even with variable wrapping paper materials. The proposed method was validated through ablation studies, which demonstrated successful task completion with a significant reduction in tear and wrinkle rates. Furthermore, the force control strategy proved to be adaptable across different wrapping paper materials and robust against variations in the size of the target object.
中文: 本研究提出了一种结合模仿学习与强化学习的混合框架,使机器人能够通过优化力控参数和轨迹跟踪来执行精确包装任务,显著降低了不同包装纸材料的撕裂与褶皱率。
English: This study presents a hybrid imitation-reinforcement learning framework that enables robots to perform precise wrapping tasks by optimizing force control and trajectory following, effectively reducing tearing and wrinkling across various paper materials.

Authors:Taslim Murad, Sarwan Ali, Murray Patterson
Title: Sequence Analysis Using the Bezier Curve
Abstract:
The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks.
中文摘要:本研究提出了一种利用贝塞尔曲线将生物序列转化为图像的新方法,增强了信息表征能力,并在多个分类任务中提升了深度学习性能。
English Summary: This study introduces a novel method using Bézier curves to transform biological sequences into images, enhancing information representation and improving deep learning classification performance across various tasks.

Authors:Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, Youngjae Yu
Title: VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Abstract:
Escape rooms present a unique cognitive challenge that demands exploration-driven planning: with the sole instruction to 'escape the room', players must actively search their environment, collecting information, and finding solutions through repeated trial and error. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multi-modal models generally fail to escape the rooms, showing considerable variation in their progress and problem-solving approaches. We find that integrating memory management and reasoning contributes to efficient exploration and enables successive hypothesis formulation and testing, thereby leading to significant improvements in dynamic and exploration-driven environments
中文: VisEscape作为包含20个虚拟密室逃脱场景的基准测试,旨在评估AI模型在动态环境中通过迭代构建知识进行探索和解题的能力,现有模型普遍表现不佳,但结合记忆管理和推理能力可显著提升其表现。
English: VisEscape is a benchmark of 20 virtual escape rooms designed to test AI models' ability to explore and solve puzzles through iterative knowledge building, where current models generally fail but integrating memory and reasoning significantly improves performance.

Authors:Ziwei Wang, Weizhi Chen, Leyang Yang, Sheng Zhou, Shengchu Zhao, Hanbei Zhan, Jiongchao Jin, Liangcheng Li, Zirui Shao, Jiajun Bu
Title: MP-GUI: Modality Perception with MLLMs for GUI Understanding
Abstract:
Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.
中文摘要:MP-GUI作为一种专为图形界面理解设计的特殊多模态大语言模型,通过专门设计的感知器提取图形、文本和空间模态信息,并采用自适应融合机制,在有限数据条件下实现了优异的界面理解性能。
English Summary: MP-GUI is a specialized multi-modal language model designed to overcome challenges in GUI understanding by integrating graphical, textual, and spatial information through tailored perceivers and adaptive fusion, achieving strong performance with limited training data.

Authors:Bowen Yuan, Yuxia Fu, Zijian Wang, Yadan Luo, Zi Huang
Title: SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization
Abstract:
Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.
Chinese: SCORE框架通过将数据集压缩构建为最小-最大优化问题,平衡信息量、判别性和可压缩性,有效解决了存储与性能的权衡问题,在大规模数据集上实现卓越性能,即使软标签被大幅压缩也仅造成轻微性能下降。
English: The SCORE framework addresses the storage-performance trade-off in dataset condensation by formulating it as a min-max optimization problem that balances informativeness, discriminativeness, and compressibility, achieving superior results on large-scale datasets with minimal performance loss even under significant soft label compression.

Authors:Jinge Ma, Jiangpeng He, Fengqing Zhu
Title: Robust3D-CIL: Robust Class-Incremental Learning for 3D Perception
Abstract:
3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model's continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.
Chinese: 本研究针对三维点云数据中的损坏问题,提出了一种基于最远点采样的类增量学习方法,通过创新的样本选择策略和高效记忆回放机制,在存在未知数据损坏的场景下将基线方法性能提升了2%至11%。
English: This study introduces a novel class-incremental learning approach for 3D point cloud perception that employs a farthest point sampling-based exemplar selection strategy and efficient memory replay to combat performance degradation caused by data corruption, achieving 2%-11% improvement over baseline methods.

Authors:Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
Title: VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Abstract:
Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.
中文: 本文提出VideoMind,一种采用角色化工作流程和Chain-of-LoRA策略的新型视频语言智能体,在时序定位视频理解任务中取得多项基准测试的最优性能。
English: This paper introduces VideoMind, a novel video-language agent that employs a role-based workflow and Chain-of-LoRA strategy for temporal-grounded video understanding, achieving state-of-the-art results across multiple benchmarks.

Authors:Robin Strässer, Manuel Schaller, Julian Berberich, Karl Worthmann, Frank Allgöwer
Title: Kernel-based error bounds of bilinear Koopman surrogate models for nonlinear data-driven control
Abstract:
We derive novel deterministic bounds on the approximation error of data-based bilinear surrogate models for unknown nonlinear systems. The surrogate models are constructed using kernel-based extended dynamic mode decomposition to approximate the Koopman operator in a reproducing kernel Hilbert space. Unlike previous methods that require restrictive assumptions on the invariance of the dictionary, our approach leverages kernel-based dictionaries that allow us to control the projection error via pointwise error bounds, overcoming a significant limitation of existing theoretical guarantees. The derived state- and input-dependent error bounds allow for direct integration into Koopman-based robust controller designs with closed-loop guarantees for the unknown nonlinear system. Numerical examples illustrate the effectiveness of the proposed framework.
中文: 本研究提出了针对非线性系统数据驱动双线性代理模型的新型确定性误差界限,利用基于核的方法改进库普曼算子逼近,并支持具有闭环保证的鲁棒控制器设计。
English: This study introduces new deterministic error bounds for data-driven bilinear surrogate models of nonlinear systems, utilizing kernel-based methods to enhance Koopman operator approximations and enable robust controller designs with closed-loop guarantees.

Authors:Yue Su, Xinyu Zhan, Hongjie Fang, Han Xue, Hao-Shu Fang, Yong-Lu Li, Cewu Lu, Lixin Yang
Title: Dense Policy: Bidirectional Autoregressive Learning of Actions
Abstract:
Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication. Project page: https: //selen-suyue.github.io/DspNet/.
中文: 本文提出稠密策略,通过双向扩展的自回归方法和轻量级编码器架构,从初始帧迭代细化动作序列并实现对数时间推理,在机器人操作任务中展现出优于现有生成策略的性能。
English: This paper introduces Dense Policy, a bidirectionally expanded autoregressive approach using a lightweight encoder-only architecture to iteratively refine action sequences from initial frames with logarithmic-time inference, demonstrating superior performance over existing generative policies in robotic manipulation.

Authors:Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu
Title: GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
Abstract:
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.
中文: 本文提出GuideDog这一包含2.2万图像-描述对的新型无障碍感知数据集,通过人机协作解决制约盲人与低视力辅助技术发展的数据瓶颈,同时开发GuideDogQA评估细粒度视觉感知能力,旨在推动基于多模态大模型的辅助技术进步。
English: This paper introduces GuideDog, a novel accessibility-aware dataset with 22K image-description pairs developed through human-AI collaboration to address the data limitations hindering multimodal large language models for blind and low vision assistance, while also proposing GuideDogQA to evaluate fine-grained visual perception capabilities.

Authors:Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu
Title: VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis
Abstract:
Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.
中文摘要:本文提出VasTSD模型,通过动态构建血管树拓扑结构的三维状态空间扩散方法,从非血管造影输入中合成血管造影图像,在多种成像模式下确保血管解剖结构的连续性,同时降低辐射暴露风险。
English Summary: This paper introduces VasTSD, a 3D vascular tree-state space diffusion model that synthesizes angiography from non-angiographic inputs by dynamically constructing vascular tree topologies, ensuring anatomical continuity across multiple imaging modalities while reducing radiation exposure risks.

Authors:Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai
Title: MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
Abstract:
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
中文摘要:MagicID通过混合偏好学习框架,结合身份保持和动态增强的奖励机制,有效解决了视频定制中身份一致性退化和动态表现不足的问题,在各项指标上均优于现有方法。
English Summary: MagicID is a novel framework that overcomes identity degradation and limited dynamics in video customization by using hybrid preference learning with identity and motion rewards, outperforming existing methods in producing identity-consistent and dynamic videos.

Authors:Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang
Title: UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Abstract:
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
中文摘要:UniVG是一种通用扩散模型,通过单一权重架构支持多种图像生成任务,在部分基准测试中甚至超越专用模型,且性能无损失。
English Summary: UniVG is a unified diffusion model that supports multiple image generation tasks with a single architecture, outperforming specialized models in some benchmarks without performance compromises.

Authors:Xun Jiang, Haoran Lu, Yuxuan Zhao, Jiarui Wang, Zizheng Guo, Heng Wu, Bei Yu, Sung Kyu Lim, Runsheng Wang, Ru Huang, Yibo Lin
Title: A Systematic Approach for Multi-objective Double-side Clock Tree Synthesis
Abstract:
As the scaling of semiconductor devices nears its limits, utilizing the back-side space of silicon has emerged as a new trend for future integrated circuits. With intense interest, several works have hacked existing backend tools to explore the potential of synthesizing double-side clock trees via nano Through-Silicon-Vias (nTSVs). However, these works lack a systematic perspective on design resource allocation and multi-objective optimization. We propose a systematic approach to design clock trees with double-side metal layers, including hierarchical clock routing, concurrent buffers and nTSVs insertion, and skew refinement. Compared with the state-of-the-art (SOTA) methods, the widely-used open-source tool, our algorithm outperforms them in latency, skew, wirelength, and the number of buffers and nTSVs.
中文摘要:本文提出了一种系统性的双面金属层时钟树设计方法,通过分层布线、同步插入缓冲器和纳米硅通孔以及偏移优化,在延迟、偏移、线长及元件数量方面均优于现有先进方法。
English Summary: This paper introduces a systematic method for designing double-side clock trees using hierarchical routing, simultaneous buffer and nano Through-Silicon-Via insertion, and skew optimization, demonstrating superior performance over existing methods in latency, skew, wirelength, and component count.

Authors:Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, Liqiang Nie
Title: Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation
Abstract:
Despite the significant success of imitation learning in robotic manipulation, its application to bimanual tasks remains highly challenging. Existing approaches mainly learn a policy to predict a distant next-best end-effector pose (NBP) and then compute the corresponding joint rotation angles for motion using inverse kinematics. However, they suffer from two important issues: (1) rarely considering the physical robotic structure, which may cause self-collisions or interferences, and (2) overlooking the kinematics constraint, which may result in the predicted poses not conforming to the actual limitations of the robot joints. In this paper, we propose Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser). Specifically, (1) to incorporate the physical robot structure information into action prediction, KStar Diffuser maintains a dynamic spatial-temporal graph according to the physical bimanual joint motions at continuous timesteps. This dynamic graph serves as the robot-structure condition for denoising the actions; (2) to make the NBP learning objective consistent with kinematics, we introduce the differentiable kinematics to provide the reference for optimizing KStar Diffuser. This module regularizes the policy to predict more reliable and kinematics-aware next end-effector poses. Experimental results show that our method effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world
中文摘要:提出的KStar Diffuser通过动态时空图整合物理结构信息和可微分运动学约束,有效解决了双手机器人操作中的自碰撞与关节限制问题,在仿真和实际场景中均能生成符合运动学要求的可靠动作。
English Summary: The proposed KStar Diffuser enhances bimanual robotic manipulation by integrating dynamic spatial-temporal graphs for physical structure awareness and differentiable kinematics for pose reliability, effectively addressing self-collisions and joint constraints in both simulations and real-world applications.

Authors:Teng Xu, Taotao Zhou, Youjia Wang, Peng Yang, Simin Tang, Kuixiang Shao, Zifeng Tang, Yifei Liu, Xinyuan Chen, Hongshuang Wang, Xiaohui Wang, Huoqing Luo, Jingya Wang, Ji Hu, Jingyi Yu
Title: MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis
Abstract:
Analyzing animal behavior is crucial in advancing neuroscience, yet quantifying and deciphering its intricate dynamics remains a significant challenge. Traditional machine vision approaches, despite their ability to detect spontaneous behaviors, fall short due to limited interpretability and reliance on manual labeling, which restricts the exploration of the full behavioral spectrum. Here, we introduce MouseGPT, a Vision-Language Model (VLM) that integrates visual cues with natural language to revolutionize mouse behavior analysis. Built upon our first-of-its-kind dataset - incorporating pose dynamics and open-vocabulary behavioral annotations across over 42 million frames of diverse psychiatric conditions - MouseGPT provides a novel, context-rich method for comprehensive behavior interpretation. Our holistic analysis framework enables detailed behavior profiling, clustering, and novel behavior discovery, offering deep insights without the need for labor - intensive manual annotation. Evaluations reveal that MouseGPT surpasses existing models in precision, adaptability, and descriptive richness, positioning it as a transformative tool for ethology and for unraveling complex behavioral dynamics in animal models.
中文: MouseGPT是一种视觉语言模型,通过整合视觉信息与自然语言,无需人工标注即可全面解析小鼠行为,在精度和适应性上均超越现有方法。
English: MouseGPT is a Vision-Language Model that integrates visual and natural language data to enable comprehensive, interpretable mouse behavior analysis without manual annotation, outperforming existing methods in precision and adaptability.

Authors:Jiajun Deng, Yaolong Ju, Jing Yang, Simon Lui, Xunying Liu
Title: Efficient Adapter Tuning for Joint Singing Voice Beat and Downbeat Tracking with Self-supervised Learning Features
Abstract:
Singing voice beat tracking is a challenging task, due to the lack of musical accompaniment that often contains robust rhythmic and harmonic patterns, something most existing beat tracking systems utilize and can be essential for estimating beats. In this paper, a novel temporal convolutional network-based beat-tracking approach featuring self-supervised learning (SSL) representations and adapter tuning is proposed to track the beat and downbeat of singing voices jointly. The SSL DistilHuBERT representations are utilized to capture the semantic information of singing voices and are further fused with the generic spectral features to facilitate beat estimation. Sources of variabilities that are particularly prominent with the non-homogeneous singing voice data are reduced by the efficient adapter tuning. Extensive experiments show that feature fusion and adapter tuning improve the performance individually, and the combination of both leads to significantly better performances than the un-adapted baseline system, with up to 31.6% and 42.4% absolute F1-score improvements on beat and downbeat tracking, respectively.
中文: 本文提出了一种基于时序卷积网络的自监督学习表征与适配器调优的歌声节拍追踪新方法,通过融合语义和频谱特征并减少数据变异性,显著提升了节拍与强拍追踪的性能。
English: This paper introduces a novel beat-tracking method for singing voices using a temporal convolutional network with self-supervised learning representations and adapter tuning, which significantly improves performance by fusing semantic and spectral features while reducing data variability.

Authors:Yu Qiao, Phuong-Nam Tran, Ji Su Yoon, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong
Title: DeepSeek-Inspired Exploration of RL-based LLMs and Synergy with Wireless Networks: A Survey
Abstract:
Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.
中文: 基于强化学习的大语言模型与无线网络相互赋能,前者提升网络智能化管理,后者支撑模型分布式部署,共同推动新一代智能通信系统发展。
English: Reinforcement learning-based large language models and wireless networks mutually empower each other, with LLMs enhancing network optimization and wireless infrastructure supporting their distributed deployment, paving the way for intelligent communication systems.

Authors:Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, Enze Xie
Title: SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Abstract:
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.
中文: SANA-Sprint是一种超快速文生图扩散模型,通过混合蒸馏技术将推理步骤缩减至1-4步,在保持高质量生成的同时实现了突破性的速度提升和实时交互能力。
English: SANA-Sprint is an ultra-fast text-to-image diffusion model that reduces inference steps to 1-4 through hybrid distillation, achieving state-of-the-art speed-quality tradeoffs and real-time generation capabilities.

Authors:Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, Enze Xie
Title: SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Abstract:
This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.
中文: SANA-Sprint是一种超快速文生图扩散模型,通过混合蒸馏技术将推理步骤缩减至1-4步,在保持高质量生成的同时实现了突破性的速度提升和实时交互能力。
English: SANA-Sprint is an ultra-fast text-to-image diffusion model that reduces inference steps to 1-4 through hybrid distillation, achieving state-of-the-art speed-quality tradeoffs and real-time generation capabilities.

Authors:Peng Chen, Pi Bu, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song, Siran Yang, Jiamang Wang, Bo Zheng
Title: CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Abstract:
Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.
中文:CombatVLA是一种专为3D游戏战斗任务优化的高效视觉语言动作模型,在基准测试中不仅性能超越现有模型,还实现了50倍加速和高于人类玩家的任务成功率。
English: CombatVLA is an efficient 3B Vision-Language-Action model optimized for combat tasks in 3D games, achieving superior performance on benchmarks with 50x faster combat speed and higher success rates than humans.

Authors:Yang Nan, Huichi Zhou, Xiaodan Xing, Giorgos Papanastasiou, Lei Zhu, Zhifan Gao, Alejandro F Fangi, Guang Yang
Title: Revisiting Medical Image Retrieval via Knowledge Consolidation
Abstract:
As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn hash functions using bottleneck features, which fail to produce representative hash codes from blended embeddings. Although contrastive hashing has shown superior performance, current approaches often treat image retrieval as a classification task, using category labels to create positive/negative pairs. Moreover, many methods fail to address the out-of-distribution (OOD) issue when models encounter external OOD queries or adversarial attacks. In this work, we propose a novel method to consolidate knowledge of hierarchical features and optimisation functions. We formulate the knowledge consolidation by introducing Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH). DaRF adaptively integrates shallow and deep representations into blended features, and SCH incorporates image fingerprints to enhance the adaptability of positive/negative pairings. These blended features further facilitate OOD detection and content-based recommendation, contributing to a secure AI-driven healthcare environment. Moreover, we present a content-guided ranking to improve the robustness and reproducibility of retrieval results. Our comprehensive assessments demonstrate that the proposed method could effectively recognise OOD samples and significantly outperform existing approaches in medical image retrieval (p<0.05). In particular, our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
中文: 本研究提出了一种结合深度感知表征融合和结构感知对比哈希的新型医学图像检索方法,通过增强特征表示和分布外检测能力,在检索性能上显著超越现有方法并实现可量化的提升。
English: This study introduces a novel medical image retrieval method combining Depth-aware Representation Fusion and Structure-aware Contrastive Hashing to enhance feature representation and out-of-distribution detection, significantly outperforming existing approaches with measurable performance improvements.

Authors:Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
Title: DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Abstract:
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
Chinese: DitHub采用模块化框架,将专家模块作为分支进行灵活适配,在开放词汇目标检测中实现了ODinW-13和ODinW-O等基准测试的最先进性能。
English: DitHub introduces a modular framework using expert modules as branches for flexible adaptation in open-vocabulary object detection, achieving state-of-the-art results on benchmarks like ODinW-13 and ODinW-O.

Authors:Minsu Kim, Rodrigo Mira, Honglie Chen, Stavros Petridis, Maja Pantic
Title: Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
Abstract:
In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on https://miraodasilva.github.io/cse-project-page .
本文提出了一种新颖的上下文语音提取方法,仅利用文本对话历史即可提取目标语音,仅需两个对话轮次就能达到90%以上的准确率,同时保持结合注册语音线索的灵活性。
This paper introduces a novel Contextual Speech Extraction method that uses only textual dialogue history to extract target speech, achieving over 90% accuracy with just two dialogue turns while maintaining flexibility to incorporate enrollment cues.

Authors:Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Title: CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement
Abstract:
While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.
中文:CAD-VAE通过引入相关潜在编码来分离目标属性和敏感属性之间的重叠信息,无需额外领域知识即可生成更公平的表征并提升图像编辑效果。
English: CAD-VAE addresses fairness in generative models by introducing a correlated latent code to separate overlapping target and sensitive attributes without requiring domain knowledge, resulting in fairer representations and improved image editing.

Authors:Xinxin Zhao, Haoyang Li, Jing Zhang, Xinmei Huang, Tieying Zhang, Jianjun Chen, Rui Shi, Cuiping Li, Hong Chen
Title: LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model
Abstract:
Index recommendation is essential for improving query performance in database management systems (DBMSs) through creating an optimal set of indexes under specific constraints. Traditional methods, such as heuristic and learning-based approaches, are effective but face challenges like lengthy recommendation time, resource-intensive training, and poor generalization across different workloads and database schemas. To address these issues, we propose LLMIdxAdvis, a resource-efficient index advisor that uses large language models (LLMs) without extensive fine-tuning. LLMIdxAdvis frames index recommendation as a sequence-to-sequence task, taking target workload, storage constraint, and corresponding database environment as input, and directly outputting recommended indexes. It constructs a high-quality demonstration pool offline, using GPT-4-Turbo to synthesize diverse SQL queries and applying integrated heuristic methods to collect both default and refined labels. During recommendation, these demonstrations are ranked to inject database expertise via in-context learning. Additionally, LLMIdxAdvis extracts workload features involving specific column statistical information to strengthen LLM's understanding, and introduces a novel inference scaling strategy combining vertical scaling (via ''Index-Guided Major Voting'' and Best-of-N) and horizontal scaling (through iterative ''self-optimization'' with database feedback) to enhance reliability. Experiments on 3 OLAP and 2 real-world benchmarks reveal that LLMIdxAdvis delivers competitive index recommendation with reduced runtime, and generalizes effectively across different workloads and database schemas.
中文: LLMIdxAdvis是一种资源高效的索引顾问,利用大型语言模型无需大量微调,将索引推荐构建为序列到序列任务直接输出最优索引,并通过实验证明其在减少运行时间的同时,在不同负载和数据库模式中具有出色的泛化能力。
English: LLMIdxAdvis is a resource-efficient index advisor that leverages large language models without extensive fine-tuning, framing index recommendation as a sequence-to-task to directly output optimal indexes, and it demonstrates competitive performance with reduced runtime and strong generalization across workloads and database schemas.

Authors:Zhiming Yao, Haoyang Li, Jing Zhang, Cuiping Li, Hong Chen
Title: A Query Optimization Method Utilizing Large Language Models
Abstract:
Query optimization is a critical task in database systems, focused on determining the most efficient way to execute a query from an enormous set of possible strategies. Traditional approaches rely on heuristic search methods and cost predictions, but these often struggle with the complexity of the search space and inaccuracies in performance estimation, leading to suboptimal plan choices. This paper presents LLMOpt, a novel framework that leverages Large Language Models (LLMs) to address these challenges through two innovative components: (1) LLM for Plan Candidate Generation (LLMOpt(G)), which eliminates heuristic search by utilizing the reasoning abilities of LLMs to directly generate high-quality query plans, and (2) LLM for Plan Candidate Selection (LLMOpt(S)), a list-wise cost model that compares candidates globally to enhance selection accuracy. To adapt LLMs for query optimization, we propose fine-tuning pre-trained models using optimization data collected offline. Experimental results on the JOB, JOB-EXT, and Stack benchmarks show that LLMOpt(G) and LLMOpt(S) outperform state-of-the-art methods, including PostgreSQL, BAO, and HybridQO. Notably, LLMOpt(S) achieves the best practical performance, striking a balance between plan quality and inference efficiency.
中文: 本文提出LLMOpt框架,利用大型语言模型生成高质量查询计划并提升选择精度,在查询优化任务中超越了现有最优方法。
English: This paper introduces LLMOpt, a novel framework that leverages Large Language Models to generate high-quality query plans and enhance selection accuracy, outperforming state-of-the-art methods in query optimization.

Authors:Somayeh Hussaini, Tobias Fischer, Michael Milford
Title: Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction
Abstract:
In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. Our approach is agnostic to the underlying VPR technique and effectively predicts SMR, and hence significantly improves VPR performance across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, and present ablation studies including an analysis of the interactions between our SMR predictor and the selected sequence length.
Chinese: 本研究提出一种监督学习方法,通过预测序列匹配接受度来选择性地信任基于序列的视觉位置识别输出,显著提升了多种技术和数据集上的性能。
English: This study introduces a supervised learning method that predicts sequence matching receptiveness to selectively trust sequence-based VPR outputs, significantly enhancing performance across various techniques and datasets.

Authors:Tianyi Zhang, Weiming Zhi, Joshua Mangelson, Matthew Johnson-Roberson
Title: Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models
Abstract:
This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.
中文摘要:本文提出DreamSea生成模型,通过基于真实水下勘测数据的扩散模型和分形嵌入技术,生成超逼真的海底三维场景,实现电影、游戏和机器人仿真等领域的照片级真实感渲染。
English Summary: This paper introduces DreamSea, a generative model trained on real-world underwater survey data to create hyper-realistic 3D seafloor scenes using diffusion techniques and fractal embeddings, enabling photorealistic rendering for applications in film, gaming, and robotics.

Authors:Xinkun Wang, Yifang Wang, Senwei Liang, Feilong Tang, Chengzhi Liu, Ming Hu, Chao Hu, Junjun He, Zongyuan Ge, Imran Razzak
Title: Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation
Abstract:
This paper discusses how ophthalmologists often rely on multimodal data to improve diagnostic accuracy. However, complete multimodal data is rare in real-world applications due to a lack of medical equipment and concerns about data privacy. Traditional deep learning methods typically address these issues by learning representations in latent space. However, the paper highlights two key limitations of these approaches: (i) Task-irrelevant redundant information (e.g., numerous slices) in complex modalities leads to significant redundancy in latent space representations. (ii) Overlapping multimodal representations make it difficult to extract unique features for each modality. To overcome these challenges, the authors propose the Essence-Point and Disentangle Representation Learning (EDRL) strategy, which integrates a self-distillation mechanism into an end-to-end framework to enhance feature selection and disentanglement for more robust multimodal learning. Specifically, the Essence-Point Representation Learning module selects discriminative features that improve disease grading performance. The Disentangled Representation Learning module separates multimodal data into modality-common and modality-unique representations, reducing feature entanglement and enhancing both robustness and interpretability in ophthalmic disease diagnosis. Experiments on multimodal ophthalmology datasets show that the proposed EDRL strategy significantly outperforms current state-of-the-art methods.
中文: 本文提出的EDRL策略通过自蒸馏机制选择关键特征并解耦模态共性与独特性表示,有效解决了眼科多模态数据中的冗余和特征纠缠问题,显著提升了疾病诊断性能。
English: This paper introduces the EDRL strategy to address redundancy and feature entanglement in multimodal ophthalmic data by using self-distillation for discriminative feature selection and disentangling modality-common and unique representations, achieving superior performance in disease diagnosis.

Authors:Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu
Title: DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Abstract:
Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.
Chinese: 提出的扩散式偏好优化方法通过句子级对齐实现了高效且与策略无关的大型语言模型人本对齐,在多种基准测试中展现出优越性能并显著降低推理延迟。
English: The proposed Diffusion-styled Preference Optimization (DPO) method enables efficient and policy-agnostic alignment of LLMs with humans by performing sentence-level optimization, achieving superior performance with reduced inference latency across various benchmarks.

Authors:Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, Dzmitry Tsetserukou
Title: RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour
Abstract:
RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/
中文: RaceVLA通过视觉-语言-动作模型实现了自主竞速无人机的人类化导航决策,在运动和语义泛化方面表现优异,并能有效应对复杂环境中的高速飞行挑战。
English: RaceVLA introduces a Visual-Language-Action model for autonomous racing drones that mimics human pilot decision-making, achieving superior performance in motion and semantic generalization while maintaining effective high-speed navigation in complex environments.

Authors:Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Oleg Sautenkov, Artem Lykov, Valerii Serpiva, Dzmitry Tsetserukou
Title: UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue
Abstract:
Emergency search and rescue (SAR) operations often require rapid and precise target identification in complex environments where traditional manual drone control is inefficient. In order to address these scenarios, a rapid SAR system, UAV-VLRR (Vision-Language-Rapid-Response), is developed in this research. This system consists of two aspects: 1) A multimodal system which harnesses the power of Visual Language Model (VLM) and the natural language processing capabilities of ChatGPT-4o (LLM) for scene interpretation. 2) A non-linearmodel predictive control (NMPC) with built-in obstacle avoidance for rapid response by a drone to fly according to the output of the multimodal system. This work aims at improving response times in emergency SAR operations by providing a more intuitive and natural approach to the operator to plan the SAR mission while allowing the drone to carry out that mission in a rapid and safe manner. When tested, our approach was faster on an average by 33.75% when compared with an off-the-shelf autopilot and 54.6% when compared with a human pilot. Video of UAV-VLRR: https://youtu.be/KJqQGKKt1xY
中文:UAV-VLRR系统融合了多模态视觉语言模型和非线性模型预测控制技术,使无人机能够执行快速避障的紧急搜救任务,响应时间比自动导航快33.75%,比人工操作快54.6%。
English: The UAV-VLRR system integrates a multimodal vision-language model and nonlinear model predictive control to enable drones to perform rapid, obstacle-avoiding emergency search and rescue, achieving response times 33.75% faster than autopilot and 54.6% faster than human pilots.

Authors:Oleg Sautenkov, Aibek Akhmetkazy, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Grik Tadevosyan, Artem Lykov, Dzmitry Tsetserukou
Title: UAV-VLPA*: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales
Abstract:
The UAV-VLPA* (Visual-Language-Planning-and-Action) system represents a cutting-edge advancement in aerial robotics, designed to enhance communication and operational efficiency for unmanned aerial vehicles (UAVs). By integrating advanced planning capabilities, the system addresses the Traveling Salesman Problem (TSP) to optimize flight paths, reducing the total trajectory length by 18.5\% compared to traditional methods. Additionally, the incorporation of the A* algorithm enables robust obstacle avoidance, ensuring safe and efficient navigation in complex environments. The system leverages satellite imagery processing combined with the Visual Language Model (VLM) and GPT's natural language processing capabilities, allowing users to generate detailed flight plans through simple text commands. This seamless fusion of visual and linguistic analysis empowers precise decision-making and mission planning, making UAV-VLPA* a transformative tool for modern aerial operations. With its unmatched operational efficiency, navigational safety, and user-friendly functionality, UAV-VLPA* sets a new standard in autonomous aerial robotics, paving the way for future innovations in the field.
中文:UAV-VLPA*系统融合视觉语言模型与先进规划算法,通过A*算法实现强效避障并将飞行轨迹总长度优化18.5%,其基于文本指令的智能规划功能为自主空中作业树立了新标杆。
English: The UAV-VLPA* system integrates visual-language models and advanced planning algorithms to optimize UAV flight paths by 18.5% and ensure safe navigation through robust obstacle avoidance, revolutionizing autonomous aerial operations with user-friendly text-based commands.

Authors:Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang
Title: Vision-Language Model IP Protection via Prompt-based Learning
Abstract:
Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP's capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.
中文: IP-CLIP提出了一种轻量级的基于提示学习的策略,通过利用CLIP视觉骨干中的风格和内容信息,结合风格增强分支和新评估指标,有效阻止模型在未授权领域的特征迁移。
English: IP-CLIP introduces a lightweight, prompt-based strategy to protect CLIP model IP by leveraging style and content information from its visual backbone, effectively blocking unauthorized domain transfers through a style-enhancement branch and new evaluation metrics.

Authors:Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu
Title: Teaching Metric Distance to Autoregressive Multimodal Foundational Models
Abstract:
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.
中文摘要:DIST2Loss是一种距离感知训练框架,通过利用输出标记间的预设距离关系,使自回归离散模型能够在保持架构兼容性的同时学习并维持有意义的距离关系,在多种多模态应用中实现稳定性能提升,尤其在低数据条件下表现突出。
English Summary: DIST2Loss is a distance-aware training framework that enables autoregressive discrete models to learn and preserve meaningful metric relationships between tokens, achieving consistent performance gains across various multimodal applications, especially in low-data scenarios.

Authors:Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu
Title: Teaching Metric Distance to Discrete Autoregressive Language Models
Abstract:
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.
中文摘要:DIST2Loss是一种距离感知训练框架,通过利用输出标记间的预设距离关系,使自回归离散模型能够在保持架构兼容性的同时学习并维持有意义的距离关系,在多种多模态应用中实现稳定性能提升,尤其在低数据条件下表现突出。
English Summary: DIST2Loss is a distance-aware training framework that enables autoregressive discrete models to learn and preserve meaningful metric relationships between tokens, achieving consistent performance gains across various multimodal applications, especially in low-data scenarios.

Authors:Ahmet Selim Çanakçı, Niclas Vödisch, Kürsat Petek, Wolfram Burgard, Abhinav Valada
Title: Label-Efficient LiDAR Panoptic Segmentation
Abstract:
A main bottleneck of learning-based robotic scene understanding methods is the heavy reliance on extensive annotated training data, which often limits their generalization ability. In LiDAR panoptic segmentation, this challenge becomes even more pronounced due to the need to simultaneously address both semantic and instance segmentation from complex, high-dimensional point cloud data. In this work, we address the challenge of LiDAR panoptic segmentation with very few labeled samples by leveraging recent advances in label-efficient vision panoptic segmentation. To this end, we propose a novel method, Limited-Label LiDAR Panoptic Segmentation (L3PS), which requires only a minimal amount of labeled data. Our approach first utilizes a label-efficient 2D network to generate panoptic pseudo-labels from a small set of annotated images, which are subsequently projected onto point clouds. We then introduce a novel 3D refinement module that capitalizes on the geometric properties of point clouds. By incorporating clustering techniques, sequential scan accumulation, and ground point separation, this module significantly enhances the accuracy of the pseudo-labels, improving segmentation quality by up to +10.6 PQ and +7.9 mIoU. We demonstrate that these refined pseudo-labels can be used to effectively train off-the-shelf LiDAR segmentation networks. Through extensive experiments, we show that L3PS not only outperforms existing methods but also substantially reduces the annotation burden. We release the code of our work at https://l3ps.cs.uni-freiburg.de.
中文: 本研究提出L3PS方法,通过从少量标注图像生成全景伪标签并利用三维模块优化,显著提升了激光雷达全景分割的性能(最高达+10.6 PQ和+7.9 mIoU),同时大幅降低了标注需求。
English: This study introduces L3PS, a novel method for LiDAR panoptic segmentation that minimizes the need for labeled data by generating pseudo-labels from 2D images and refining them with a 3D module, achieving significant performance improvements of up to +10.6 PQ and +7.9 mIoU while reducing annotation efforts.

Authors:Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, Cuiping Li
Title: OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale
Abstract:
Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.
中文: 本文提出了一种可扩展的文本到SQL数据合成框架,基于此构建的OmniSQL开源模型在多个基准测试中达到了最先进的性能水平。
English: This paper introduces a scalable framework for synthesizing large-scale text-to-SQL datasets, which enables the development of OmniSQL, an open-source model that achieves state-of-the-art performance across multiple benchmarks.

Authors:Kunjun Li, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang
Title: Technical Report for ReID-SAM on SkiTB Visual Tracking Challenge 2025
Abstract:
This report introduces ReID-SAM, a novel model developed for the SkiTB Challenge that addresses the complexities of tracking skier appearance. Our approach integrates the SAMURAI tracker with a person re-identification (Re-ID) module and advanced post-processing techniques to enhance accuracy in challenging skiing scenarios. We employ an OSNet-based Re-ID model to minimize identity switches and utilize YOLOv11 with Kalman filtering or STARK-based object detection for precise equipment tracking. When evaluated on the SkiTB dataset, ReID-SAM achieved a state-of-the-art F1-score of 0.870, surpassing existing methods across alpine, ski jumping, and freestyle skiing disciplines. These results demonstrate significant advancements in skier tracking accuracy and provide valuable insights for computer vision applications in winter sports.
Chinese: ReID-SAM是一种创新模型,将SAMURAI追踪器与行人重识别模块及先进后处理技术相结合,在SkiTB数据集上实现了0.870的最优F1分数,显著提升了冬季运动中滑雪者追踪的准确性。
English: ReID-SAM is a novel model combining SAMURAI tracking with a Re-ID module and advanced post-processing, achieving a state-of-the-art F1-score of 0.870 on the SkiTB dataset for enhanced skier tracking accuracy in winter sports.

Authors:Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos Vougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, Maja Pantic
Title: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Abstract:
Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.
中文摘要:KeyFace是一种基于扩散模型的双阶段框架,首先生成音频和身份信息驱动的关键帧,再通过插值实现平滑过渡,从而在长时视频中生成自然连贯的面部动画,有效处理非语音发声和连续情感表达。
English Summary: KeyFace is a two-stage diffusion-based framework that generates keyframes from audio and identity inputs, then interpolates them to produce natural, coherent facial animations over long durations, effectively handling non-speech vocalizations and continuous emotions.

Authors:Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Title: Regret Minimization for Piecewise Linear Rewards: Contracts, Auctions, and Beyond
Abstract:
Most microeconomic models of interest involve optimizing a piecewise linear function. These include contract design in hidden-action principal-agent problems, selling an item in posted-price auctions, and bidding in first-price auctions. When the relevant model parameters are unknown and determined by some (unknown) probability distributions, the problem becomes learning how to optimize an unknown and stochastic piecewise linear reward function. Such a problem is usually framed within an online learning framework, where the decision-maker (learner) seeks to minimize the regret of not knowing an optimal decision in hindsight. This paper introduces a general online learning framework that offers a unified approach to tackle regret minimization for piecewise linear rewards, under a suitable monotonicity assumption commonly satisfied by microeconomic models. We design a learning algorithm that attains a regret of $\widetilde{O}(\sqrt{nT})$, where $n$ is the number of ``pieces'' of the reward function and $T$ is the number of rounds. This result is tight when $n$ is \emph{small} relative to $T$, specifically when $n \leq T^{1/3}$. Our algorithm solves two open problems in the literature on learning in microeconomic settings. First, it shows that the $\widetilde{O}(T^{2/3})$ regret bound obtained by Zhu et al. [Zhu+23] for learning optimal linear contracts in hidden-action principal-agent problems is not tight when the number of agent's actions is small relative to $T$. Second, our algorithm demonstrates that, in the problem of learning to set prices in posted-price auctions, it is possible to attain suitable (and desirable) instance-independent regret bounds, addressing an open problem posed by Cesa-Bianchi et al. [CBCP19].
中文: 本文针对微观经济模型中的随机分段线性奖励函数优化,提出了一个统一的在线学习框架,在单调性假设下实现了$\widetilde{O}(\sqrt{nT})$的紧致遗憾界,并解决了合约设计和拍卖定价中的两个开放性问题。
English: This paper presents a unified online learning framework for optimizing stochastic piecewise linear reward functions in microeconomic models, achieving a tight regret bound of $\widetilde{O}(\sqrt{nT})$ under a monotonicity assumption and resolving two open problems in contract design and auction pricing.

Authors:Anna Lunghi, Matteo Castiglioni, Alberto Marchesi
Title: Online Two-Sided Markets: Many Buyers Enhance Learning
Abstract:
We study a repeated trading problem in which a mechanism designer facilitates trade between a single seller and multiple buyers. Our model generalizes the classic bilateral trade setting to a multi-buyer environment. Specifically, the mechanism designer runs a second-price auction among the buyers -- extending the fixed-price mechanism used in bilateral trade -- before proposing a price to the seller. While this setting introduces new challenges compared to bilateral trade, it also provides an informational advantage. Indeed, the presence of multiple buyers enhances competition, inducing them to reveal their valuations in order to win the auction. However, as in bilateral trade, the seller faces a binary decision: whether to accept the proposed price or not. We show that this asymmetric feedback, which is more informative than in bilateral trade, allows us to break some lower bounds on regret minimization with a single buyer. In particular, we provide a $\tilde O(T^{2/3})$ regret upper bound with respect to an optimal strong budget-balanced mechanism, without any assumptions on the distribution of valuations. Our main tool for achieving this result is the design of an adaptive grid that approximates the optimal gain from trade across the continuum of possible mechanisms. Furthermore, we attain the same regret bound with respect to an optimal global budget-balanced mechanism, under two possible conditions: (i) buyers' and seller's valuations are independent, or (ii) valuations are drawn from a distribution with bounded density. In doing so, we provide some novel technical results on constrained MABs with feedback graphs, which may be of independent interest.
中文摘要:本研究将双边交易扩展至多买家环境,通过二级价格拍卖和自适应机制设计,在不依赖估值分布假设的情况下实现了更优的遗憾上界。
English summary: This research extends bilateral trade to a multi-buyer setting using second-price auctions and achieves improved regret bounds without distributional assumptions through adaptive mechanism design.

Authors:Artem Lykov, Valerii Serpiva, Muhammad Haris Khan, Oleg Sautenkov, Artyom Myshlyaev, Grik Tadevosyan, Yasheerah Yaqoot, Dzmitry Tsetserukou
Title: CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs
Abstract:
This paper introduces CognitiveDrone, a novel Vision-Language-Action (VLA) model tailored for complex Unmanned Aerial Vehicles (UAVs) tasks that demand advanced cognitive abilities. Trained on a dataset comprising over 8,000 simulated flight trajectories across three key categories-Human Recognition, Symbol Understanding, and Reasoning-the model generates real-time 4D action commands based on first-person visual inputs and textual instructions. To further enhance performance in intricate scenarios, we propose CognitiveDrone-R1, which integrates an additional Vision-Language Model (VLM) reasoning module to simplify task directives prior to high-frequency control. Experimental evaluations using our open-source benchmark, CognitiveDroneBench, reveal that while a racing-oriented model (RaceVLA) achieves an overall success rate of 31.3%, the base CognitiveDrone model reaches 59.6%, and CognitiveDrone-R1 attains a success rate of 77.2%. These results demonstrate improvements of up to 30% in critical cognitive tasks, underscoring the effectiveness of incorporating advanced reasoning capabilities into UAV control systems. Our contributions include the development of a state-of-the-art VLA model for UAV control and the introduction of the first dedicated benchmark for assessing cognitive tasks in drone operations. The complete repository is available at cognitivedrone.github.io
Chinese: 本文介绍了CognitiveDrone这一面向无人机的视觉-语言-动作模型,通过集成推理模块在认知任务中取得77.2%的成功率,同时推出了首个无人机认知任务专用基准测试。
English: This paper presents CognitiveDrone, a Vision-Language-Action model for UAVs that achieves a 77.2% success rate in cognitive tasks through integrated reasoning modules, alongside the introduction of the first dedicated benchmark for drone cognitive operations.

Authors:Chuang Cheng, Xinglong Zhang, Xieyuanli Chen, Wei Dai, Longwen Chen, Daoxun Zhang, Hui Zhang, Jie Jiang, Huimin Lu
Title: Flexible Exoskeleton Control Based on Binding Alignment Strategy and Full-arm Coordination Mechanism
Abstract:
In rehabilitation, powered, and teleoperation exoskeletons, connecting the human body to the exoskeleton through binding attachments is a common configuration. However, the uncertainty of the tightness and the donning deviation of the binding attachments will affect the flexibility and comfort of the exoskeletons, especially during high-speed movement. To address this challenge, this paper presents a flexible exoskeleton control approach with binding alignment and full-arm coordination. Firstly, the sources of the force interaction caused by donning offsets are analyzed, based on which the interactive force data is classified into the major, assistant, coordination, and redundant component categories. Then, a binding alignment strategy (BAS) is proposed to reduce the donning disturbances by combining different force data. Furthermore, we propose a full-arm coordination mechanism (FCM) that focuses on two modes of arm movement intent, joint-oriented and target-oriented, to improve the flexible performance of the whole exoskeleton control during high-speed motion. In this method, we propose an algorithm to distinguish the two intentions to resolve the conflict issue of the force component. Finally, a series of experiments covering various aspects of exoskeleton performance (flexibility, adaptability, accuracy, speed, and fatigue) were conducted to demonstrate the benefits of our control framework in our full-arm exoskeleton.
Chinese: 本文提出了一种结合绑定对齐和全臂协调的柔性外骨骼控制方法,旨在减少穿戴偏移带来的干扰,并提升高速运动时的整体性能。
English: This paper introduces a flexible exoskeleton control method that uses binding alignment and full-arm coordination to reduce disturbances from donning deviations and enhance performance during high-speed movements.

Authors:Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, Kai Xu
Title: OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging
Abstract:
Online zero-shot 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential - yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from $O(n^2)$ to $O(n)$. Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, zero-shot 3D instance segmentation with leading efficiency.
中文: 本文提出一种高效的在线零样本三维实例分割方法,通过结合二维视觉基础模型生成的掩码与体素哈希技术,将空间重叠查询的时间复杂度从O(n²)降至O(n),在基准测试中以领先效率实现了最优性能。
English: This paper introduces an efficient online method for zero-shot 3D instance segmentation by leveraging 2D visual foundation model masks and voxel hashing to reduce spatial overlap queries from O(n²) to O(n), achieving state-of-the-art performance on benchmarks with superior efficiency.

Authors:Junjie Sheng, Jiehao Wu, Haochuan Cui, Yiqiu Hu, Wenli Zhou, Lei Zhu, Qian Peng, Wenhao Li, Xiangfeng Wang
Title: Scalable Reinforcement Learning for Virtual Machine Scheduling
Abstract:
Recent advancements in reinforcement learning (RL) have shown promise for optimizing virtual machine scheduling (VMS) in small-scale clusters. The utilization of RL to large-scale cloud computing scenarios remains notably constrained. This paper introduces a scalable RL framework, called Cluster Value Decomposition Reinforcement Learning (CVD-RL), to surmount the scalability hurdles inherent in large-scale VMS. The CVD-RL framework innovatively combines a decomposition operator with a look-ahead operator to adeptly manage representation complexities, while complemented by a Top-$k$ filter operator that refines exploration efficiency. Different from existing approaches limited to clusters of $10$ or fewer physical machines (PMs), CVD-RL extends its applicability to environments encompassing up to $50$ PMs. Furthermore, the CVD-RL framework demonstrates generalization capabilities that surpass contemporary SOTA methodologies across a variety of scenarios in empirical studies. This breakthrough not only showcases the framework's exceptional scalability and performance but also represents a significant leap in the application of RL for VMS within complex, large-scale cloud infrastructures. The code is available at https://anonymous.4open.science/r/marl4sche-D0FE.
中文摘要:本文提出的可扩展CVD-RL框架突破了大规模虚拟机调度中的扩展性限制,将适用范围扩展至50台物理机,并在多种场景中展现出卓越的泛化性能。
English Summary: This paper introduces the scalable CVD-RL framework that overcomes scalability limitations in large-scale virtual machine scheduling, extending applicability to 50 physical machines while demonstrating superior generalization capabilities across diverse scenarios.

Authors:Yunian Pan, Tao Li, Quanyan Zhu
Title: Model-Agnostic Meta-Policy Optimization via Zeroth-Order Estimation: A Linear Quadratic Regulator Perspective
Abstract:
Meta-learning has been proposed as a promising machine learning topic in recent years, with important applications to image classification, robotics, computer games, and control systems. In this paper, we study the problem of using meta-learning to deal with uncertainty and heterogeneity in ergodic linear quadratic regulators. We integrate the zeroth-order optimization technique with a typical meta-learning method, proposing an algorithm that omits the estimation of policy Hessian, which applies to tasks of learning a set of heterogeneous but similar linear dynamic systems. The induced meta-objective function inherits important properties of the original cost function when the set of linear dynamic systems are meta-learnable, allowing the algorithm to optimize over a learnable landscape without projection onto the feasible set. We provide stability and convergence guarantees for the exact gradient descent process by analyzing the boundedness and local smoothness of the gradient for the meta-objective, which justify the proposed algorithm with gradient estimation error being small. We provide the sample complexity conditions for these theoretical guarantees, as well as a numerical example at the end to corroborate this perspective.
中文: 本文提出一种结合零阶优化的元学习算法,用于处理遍历线性二次调节器中的不确定性,在无需策略海森矩阵估计的情况下提供了稳定性与收敛性的理论保证。
English: This paper introduces a meta-learning algorithm combining zeroth-order optimization to handle uncertainty in ergodic linear quadratic regulators, providing theoretical guarantees for stability and convergence without policy Hessian estimation.

Authors:Bin Xie, Yingfei Liu, Tiancai Wang, Jiale Cao, Xiangyu Zhang
Title: Glad: A Streaming Scene Generator for Autonomous Driving
Abstract:
The generation and simulation of diverse real-world scenes have significant application value in the field of autonomous driving, especially for the corner cases. Recently, researchers have explored employing neural radiance fields or diffusion models to generate novel views or synthetic data under driving scenes. However, these approaches suffer from unseen scenes or restricted video length, thus lacking sufficient adaptability for data generation and simulation. To address these issues, we propose a simple yet effective framework, named Glad, to generate video data in a frame-by-frame style. To ensure the temporal consistency of synthetic video, we introduce a latent variable propagation module, which views the latent features of previous frame as noise prior and injects it into the latent features of current frame. In addition, we design a streaming data sampler to orderly sample the original image in a video clip at continuous iterations. Given the reference frame, our Glad can be viewed as a streaming simulator by generating the videos for specific scenes. Extensive experiments are performed on the widely-used nuScenes dataset. Experimental results demonstrate that our proposed Glad achieves promising performance, serving as a strong baseline for online video generation. We will release the source code and models publicly.
中文摘要:提出的Glad框架通过潜在变量传播模块和流式数据采样器,有效生成了自动驾驶场景中时序一致的视频数据,解决了现有方法的局限性,并在nuScenes数据集上展现出优越性能。
English Summary: The proposed Glad framework effectively generates temporally consistent video data for autonomous driving scenes by using a latent variable propagation module and streaming data sampler, addressing limitations in existing methods and demonstrating strong performance on the nuScenes dataset.

Authors:Fang Yan, Jianfeng Wu, Jiawen Li, Wei Wang, Jiaxuan Lu, Wen Chen, Zizhao Gao, Jianan Li, Hong Yan, Jiabo Ma, Minda Chen, Yang Lu, Qing Chen, Yizhi Wang, Xitong Ling, Xuenian Wang, Zihan Wang, Qiang Huang, Shengyi Hua, Mianxin Liu, Lei Ma, Tian Shen, Xiaofan Zhang, Yonghong He, Hao Chen, Shaoting Zhang, Zhe Wang
Title: PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks
Abstract:
The complexity and variability inherent in high-resolution pathological images present significant challenges in computational pathology. While pathology foundation models leveraging AI have catalyzed transformative advancements, their development demands large-scale datasets, considerable storage capacity, and substantial computational resources. Furthermore, ensuring their clinical applicability and generalizability requires rigorous validation across a broad spectrum of clinical tasks. Here, we present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides from 20 tissue and organ types across multiple centers. The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets. These tasks encompass digital slide preprocessing, pan-cancer classification, lesion identification, multi-cancer subtype classification, biomarker assessment, gene expression prediction, and the generation of structured reports. PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks, including pan-cancer classification across various organs, lymphoma subtype diagnosis, and bladder cancer screening. Notably, it is the first model to generate structured reports for high-incidence colorectal cancer and diagnostically complex lymphoma-areas that are infrequently addressed by foundational models but hold immense clinical potential. Overall, PathOrchestra exemplifies the feasibility and efficacy of a large-scale, self-supervised pathology foundation model, validated across a broad range of clinical-grade tasks. Its high accuracy and reduced reliance on extensive data annotation underline its potential for clinical integration, offering a pathway toward more efficient and high-quality medical services.
中文: PathOrchestra是一种大规模、自监督的病理学基础模型,在112项临床任务中表现出卓越的准确性,涵盖癌症分类和结构化报告生成,减少了对数据标注的依赖,展现了临床应用的巨大潜力。
English: PathOrchestra is a large-scale, self-supervised pathology foundation model that demonstrates exceptional accuracy across 112 clinical tasks, including cancer classification and structured report generation, with reduced reliance on data annotation, showcasing strong potential for clinical integration.

Authors:Guanhua Chen, Yutong Yao, Ci-Jun Gao, Lidia S. Chao, Feng Wan, Derek F. Wong
Title: Not All LoRA Parameters Are Essential: Insights on Inference Necessity
Abstract:
Current research on LoRA primarily focuses on minimizing the number of fine-tuned parameters or optimizing its architecture. However, the necessity of all fine-tuned LoRA layers during inference remains underexplored. In this paper, we investigate the contribution of each LoRA layer to the model's ability to predict the ground truth and hypothesize that lower-layer LoRA modules play a more critical role in model reasoning and understanding. To address this, we propose a simple yet effective method to enhance the performance of large language models (LLMs) fine-tuned with LoRA. Specifically, we identify a ``boundary layer'' that distinguishes essential LoRA layers by analyzing a small set of validation samples. During inference, we drop all LoRA layers beyond this boundary. We evaluate our approach on three strong baselines across four widely-used text generation datasets. Our results demonstrate consistent and significant improvements, underscoring the effectiveness of selectively retaining critical LoRA layers during inference.
中文摘要:本研究提出一种方法,通过在推理过程中仅保留关键的LoRA层来增强经LoRA微调的大语言模型,该方法在多个数据集上显著提升了模型性能。
English Summary: This study proposes a method to enhance LoRA-fine-tuned LLMs by identifying and retaining only essential LoRA layers during inference, which significantly improves performance across multiple datasets.

Authors:Haibo Hu, Jiacheng Zuo, Yang Lou, Yufei Cui, Jianping Wang, Nan Guan, Jin Wang, Yung-Hui Li, Chun Jason Xue
Title: VLM-C4L: Continual Core Dataset Learning with Corner Case Optimization via Vision-Language Models for Autonomous Driving
Abstract:
With the widespread adoption and deployment of autonomous driving, handling complex environments has become an unavoidable challenge. Due to the scarcity and diversity of extreme scenario datasets, current autonomous driving models struggle to effectively manage corner cases. This limitation poses a significant safety risk, according to the National Highway Traffic Safety Administration (NHTSA), autonomous vehicle systems have been involved in hundreds of reported crashes annually in the United States, occurred in corner cases like sun glare and fog, which caused a few fatal accident. Furthermore, in order to consistently maintain a robust and reliable autonomous driving system, it is essential for models not only to perform well on routine scenarios but also to adapt to newly emerging scenarios, especially those corner cases that deviate from the norm. This requires a learning mechanism that incrementally integrates new knowledge without degrading previously acquired capabilities. However, to the best of our knowledge, no existing continual learning methods have been proposed to ensure consistent and scalable corner case learning in autonomous driving. To address these limitations, we propose VLM-C4L, a continual learning framework that introduces Vision-Language Models (VLMs) to dynamically optimize and enhance corner case datasets, and VLM-C4L combines VLM-guided high-quality data extraction with a core data replay strategy, enabling the model to incrementally learn from diverse corner cases while preserving performance on previously routine scenarios, thus ensuring long-term stability and adaptability in real-world autonomous driving. We evaluate VLM-C4L on large-scale real-world autonomous driving datasets, including Waymo and the corner case dataset CODA.
自动驾驶因极端场景数据稀缺而难以有效处理复杂环境,存在安全隐患,而提出的VLM-C4L框架结合视觉语言模型与持续学习机制,通过动态优化数据集来提升系统在真实驾驶中的长期适应性和稳定性。
Autonomous driving faces challenges in handling extreme scenarios due to limited corner case data, leading to safety risks, and the proposed VLM-C4L framework uses vision-language models and continual learning to enhance adaptability and stability in real-world conditions.

Authors:Zhiyu Yang, Shuo Wang, Yukun Yan, Yang Deng
Title: Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Abstract:
LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs' capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs' debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future. DSDBench is publicly available at github.com/KevinCL16/DSDBench.
中文摘要:DSDBench是首个系统性评估大语言模型在数据科学代码调试中多步错误追踪和多错误检测能力的基准测试,揭示了当前模型在修复复杂逻辑运行时错误方面存在显著不足。
English Summary: DSDBench is the first benchmark designed to systematically evaluate LLMs' ability to debug complex data science code by detecting multi-hop errors and multiple bugs, revealing significant performance gaps in current models.

Authors:Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, Qianqian Wang
Title: Segment Any Motion in Videos
Abstract:
Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.
Chinese: 该模型结合长程轨迹运动线索与基于DINO的语义特征,并利用SAM2通过迭代提示策略实现像素级掩码优化,在多种复杂场景下的运动物体分割任务中取得了领先性能。
English: The proposed model integrates long-range trajectory motion cues with DINO-based semantic features and utilizes SAM2 for iterative mask refinement, achieving state-of-the-art performance in moving object segmentation across diverse challenging scenarios.

Authors:Zongyuan Zhang, Tianyang Duan, Zheng Lin, Dong Huang, Zihan Fang, Zekai Sun, Ling Xiong, Hongbin Liang, Heming Cui, Yong Cui, Yue Gao
Title: Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks
Abstract:
Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its realworld deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent's robustness through adversarial defense mechanisms.
中文: AGMR攻击方法通过动态识别关键状态维度和优化对抗策略,有效针对深度强化学习智能体,在降低其性能并通过对抗防御增强鲁棒性方面优于现有方法。
English: The AGMR attack method effectively targets deep reinforcement learning agents by dynamically identifying critical state dimensions and optimizing adversarial policies, outperforming existing methods in degrading performance and enhancing robustness through adversarial defense.

Authors:Zongyuan Zhang, Tianyang Duan, Zheng Lin, Dong Huang, Zihan Fang, Zekai Sun, Ling Xiong, Hongbin Liang, Heming Cui, Yong Cui
Title: State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning
Abstract:
Recently, deep reinforcement learning (DRL) has emerged as a promising approach for robotic control. However, the deployment of DRL in real-world robots is hindered by its sensitivity to environmental perturbations. While existing whitebox adversarial attacks rely on local gradient information and apply uniform perturbations across all states to evaluate DRL robustness, they fail to account for temporal dynamics and state-specific vulnerabilities. To combat the above challenge, we first conduct a theoretical analysis of white-box attacks in DRL by establishing the adversarial victim-dynamics Markov decision process (AVD-MDP), to derive the necessary and sufficient conditions for a successful attack. Based on this, we propose a selective state-aware reinforcement adversarial attack method, named STAR, to optimize perturbation stealthiness and state visitation dispersion. STAR first employs a soft mask-based state-targeting mechanism to minimize redundant perturbations, enhancing stealthiness and attack effectiveness. Then, it incorporates an information-theoretic optimization objective to maximize mutual information between perturbations, environmental states, and victim actions, ensuring a dispersed state-visitation distribution that steers the victim agent into vulnerable states for maximum return reduction. Extensive experiments demonstrate that STAR outperforms state-of-the-art benchmarks.
中文: 深度强化学习在现实机器人部署中因环境敏感性受阻,为此提出了STAR方法,这是一种选择性状态感知攻击技术,通过优化扰动隐蔽性和状态访问分散性,有效降低智能体性能。
English: Deep reinforcement learning faces challenges in real-world robotic deployment due to environmental sensitivity, prompting the development of STAR, a selective state-aware attack method that optimizes perturbation stealthiness and state visitation dispersion to effectively reduce agent performance.

Authors:Francesco Micheli, Efe C. Balta, Anastasios Tsiamis, John Lygeros
Title: Wasserstein Distributionally Robust Bayesian Optimization with Continuous Context
Abstract:
We address the challenge of sequential data-driven decision-making under context distributional uncertainty. This problem arises in numerous real-world scenarios where the learner optimizes black-box objective functions in the presence of uncontrollable contextual variables. We consider the setting where the context distribution is uncertain but known to lie within an ambiguity set defined as a ball in the Wasserstein distance. We propose a novel algorithm for Wasserstein Distributionally Robust Bayesian Optimization that can handle continuous context distributions while maintaining computational tractability. Our theoretical analysis combines recent results in self-normalized concentration in Hilbert spaces and finite-sample bounds for distributionally robust optimization to establish sublinear regret bounds that match state-of-the-art results. Through extensive comparisons with existing approaches on both synthetic and real-world problems, we demonstrate the simplicity, effectiveness, and practical applicability of our proposed method.
中文: 本文提出了一种计算高效的Wasserstein分布鲁棒贝叶斯优化算法,解决了上下文分布不确定下的序列决策问题,并通过理论分析和实验验证了其优越性能。
English: This paper introduces a computationally efficient algorithm for Wasserstein distributionally robust Bayesian optimization, which addresses sequential decision-making under uncertain context distributions and achieves sublinear regret with theoretical guarantees.

Authors:Bohan Zhai, Canwen Xu, Yuxiong He, Zhewei Yao
Title: ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback
Abstract:
Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.
中文摘要:提出的ExCoT框架通过将思维链推理与直接偏好优化相结合,无需奖励模型或人工标注即可显著提升文本转SQL性能,在BIRD和Spider数据集上实现了最先进的成果。
English summary: The proposed ExCoT framework significantly enhances text-to-SQL performance by combining Chain-of-Thought reasoning with Direct Preference Optimization, achieving state-of-the-art results on BIRD and Spider datasets without requiring reward models or human annotations.

Authors:Riccardo Zuliani, Efe C. Balta, Alisa Rupenyan, John Lygeros
Title: Iterative Learning Predictive Control for Constrained Uncertain Systems
Abstract:
Iterative learning control (ILC) improves the performance of a repetitive system by learning from previous trials. ILC can be combined with Model Predictive Control (MPC) to mitigate non-repetitive disturbances, thus improving overall system performance. However, existing approaches either assume perfect model knowledge or fail to actively learn system uncertainties, leading to conservativeness. To address these limitations we propose a binary mixed-integer ILC scheme, combined with a convex MPC scheme, that ensures robust constraint satisfaction, non-increasing nominal cost, and convergence to optimal performance. Our scheme is designed for uncertain nonlinear systems subject to both bounded additive stochastic noise and additive uncertain components. We showcase the benefits of our scheme in simulation.
中文: 提出的二进制混合整数迭代学习控制与凸模型预测控制相结合,确保了对带有干扰的不确定非线性系统的鲁棒约束满足和最优性能收敛。
English: The proposed binary mixed-integer iterative learning control combined with convex model predictive control ensures robust constraint satisfaction and convergence to optimal performance for uncertain nonlinear systems with disturbances.

Authors:Runze Cheng, Yao Sun, Lan Zhang, Lei Feng, Lei Zhang, Muhammad Ali Imran
Title: A semantic communication-based workload-adjustable transceiver for wireless AI-generated content (AIGC) delivery
Abstract:
With the significant advances in generative AI (GAI) and the proliferation of mobile devices, providing high-quality AI-generated content (AIGC) services via wireless networks is becoming the future direction. However, the primary challenges of AIGC service delivery in wireless networks lie in unstable channels, limited bandwidth resources, and unevenly distributed computational resources. In this paper, we employ semantic communication (SemCom) in diffusion-based GAI models to propose a Resource-aware wOrkload-adjUstable TransceivEr (ROUTE) for AIGC delivery in dynamic wireless networks. Specifically, to relieve the communication resource bottleneck, SemCom is utilized to prioritize semantic information of the generated content. Then, to improve computational resource utilization in both edge and local and reduce AIGC semantic distortion in transmission, modified diffusion-based models are applied to adjust the computing workload and semantic density in cooperative content generation. Simulations verify the superiority of our proposed ROUTE in terms of latency and content quality compared to conventional AIGC approaches.
中文摘要:本文提出ROUTE方案,通过语义通信与扩散模型结合,在无线网络中动态优化通信和计算资源,以降低AIGC传输延迟并提升内容质量。
English Summary: This paper introduces ROUTE, a resource-aware transceiver using semantic communication in diffusion models to enhance AIGC delivery in wireless networks by optimizing communication and computational resources for reduced latency and improved content quality.

Authors:Christopher Ummerle, Antonio Giganti, Sara Mandelli, Paolo Bestagini, Stefano Tubaro
Title: Leveraging Land Cover Priors for Isoprene Emission Super-Resolution
Abstract:
Remote sensing plays a crucial role in monitoring Earth's ecosystems, yet satellite-derived data often suffer from limited spatial resolution, restricting their applicability in atmospheric modeling and climate research. In this work, we propose a deep learning-based Super-Resolution (SR) framework that leverages land cover information to enhance the spatial accuracy of Biogenic Volatile Organic Compounds (BVOCs) emissions, with a particular focus on isoprene. Our approach integrates land cover priors as emission drivers, capturing spatial patterns more effectively than traditional methods. We evaluate the model's performance across various climate conditions and analyze statistical correlations between isoprene emissions and key environmental information such as cropland and tree cover data. Additionally, we assess the generalization capabilities of our SR model by applying it to unseen climate zones and geographical regions. Experimental results demonstrate that incorporating land cover data significantly improves emission SR accuracy, particularly in heterogeneous landscapes. This study contributes to atmospheric chemistry and climate modeling by providing a cost-effective, data-driven approach to refining BVOC emission maps. The proposed method enhances the usability of satellite-based emissions data, supporting applications in air quality forecasting, climate impact assessments, and environmental studies.
中文: 本研究提出了一种基于深度学习的超分辨率框架,通过整合土地覆盖数据显著提高了生物挥发性有机化合物(尤其是异戊二烯)排放的空间精度,增强了其在大气和气候研究中的实用性。
English: This study introduces a deep learning-based super-resolution framework that integrates land cover data to significantly enhance the spatial accuracy of biogenic volatile organic compound emissions, particularly isoprene, improving their applicability in atmospheric and climate research.

Authors:Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu
Title: Breaking the Encoder Barrier for Seamless Video-Language Understanding
Abstract:
Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.
Chinese: ELVA是一种无需编码器的视频大语言模型,通过令牌合并和混合分辨率输入直接建模视频-语言交互,仅用700万训练数据即可达到基于编码器模型的性能,同时将计算成本降低高达95%。
English: ELVA is an encoder-free Video-LLM that directly models video-language interactions using token merging and hybrid-resolution inputs, achieving performance comparable to encoder-based models while reducing computational costs by up to 95% with only 7M training pairs.

Authors:Yu Mao, Jun Wang, Nan Guan, Chun Jason Xue
Title: WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
Abstract:
Whole-Slide Images (WSIs) have revolutionized medical analysis by presenting high-resolution images of the whole tissue slide. Despite avoiding the physical storage of the slides, WSIs require considerable data volume, which makes the storage and maintenance of WSI records costly and unsustainable. To this end, this work presents the first investigation of lossless compression of WSI images. Interestingly, we find that most existing compression methods fail to compress the WSI images effectively. Furthermore, our analysis reveals that the failure of existing compressors is mainly due to information irregularity in WSI images. To resolve this issue, we developed a simple yet effective lossless compressor called WISE, specifically designed for WSI images. WISE employs a hierarchical encoding strategy to extract effective bits, reducing the entropy of the image and then adopting a dictionary-based method to handle the irregular frequency patterns. Through extensive experiments, we show that WISE can effectively compress the gigapixel WSI images to 36 times on average and up to 136 times.
中文: 全玻片图像(WSI)因数据量大导致存储成本高昂,为此开发了无损压缩器WISE,它采用分层编码和字典方法,能将图像平均压缩36倍,最高达136倍。
English: Whole-Slide Images (WSIs) present significant storage challenges due to their large data volume, leading to the development of WISE, a lossless compressor that effectively reduces WSI size by up to 136 times through hierarchical encoding and dictionary-based methods.

Authors:Yan Kyaw Tun, Nway Nway Ei, Sheikh Salman Hassan, Cedomir Stefanovic, Nguyen Van Huynh, Madyan Alsenwi, Choong Seon Hong
Title: Joint Beamforming and Trajectory Optimization for Multi-UAV-Assisted Integrated Sensing and Communication Systems
Abstract:
In this paper, we investigate beamforming design and trajectory optimization for a multi-unmanned aerial vehicle (UAV)-assisted integrated sensing and communication (ISAC) system. The proposed system employs multiple UAVs equipped with dual-functional radar-communication capabilities to simultaneously perform target sensing and provide communication services to users. We formulate a joint optimization problem that aims to maximize the sum rate of users while maintaining target sensing performance through coordinated beamforming and UAV trajectory design. To address this challenging non-convex problem, we develop a block coordinated descent (BCD)-based iterative algorithm that decomposes the original problem into tractable subproblems. Then, the beamforming design problem is addressed using fractional programming, while the UAV trajectory is refined through the deep deterministic policy gradient (DDPG) algorithm. The simulation results demonstrate that the proposed joint optimization approach achieves significant performance improvements in both communication throughput and sensing accuracy compared to conventional, separated designs. We also show that proper coordination of multiple UAVs through optimized trajectories and beamforming patterns can effectively balance the tradeoff between sensing and communication objectives.
中文: 本文针对多无人机综合感知与通信系统,提出了一种联合波束成形与轨迹优化框架,通过基于块坐标下降的算法有效提升了通信吞吐量和感知精度,并平衡了二者之间的性能权衡。
English: This paper develops a joint beamforming and trajectory optimization framework for multi-UAV ISAC systems, employing BCD-based algorithms to enhance both communication throughput and sensing accuracy while balancing their tradeoffs.

Authors:Long Yuan, Fengran Mo, Kaiyu Huang, Wenjie Wang, Wangyuxuan Zhai, Xiaoyu Zhu, You Li, Jinan Xu, Jian-Yun Nie
Title: OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence
Abstract:
The rapid advancement of multimodal large language models (LLMs) has opened new frontiers in artificial intelligence, enabling the integration of diverse large-scale data types such as text, images, and spatial information. In this paper, we explore the potential of multimodal LLMs (MLLM) for geospatial artificial intelligence (GeoAI), a field that leverages spatial data to address challenges in domains including Geospatial Semantics, Health Geography, Urban Geography, Urban Perception, and Remote Sensing. We propose a MLLM (OmniGeo) tailored to geospatial applications, capable of processing and analyzing heterogeneous data sources, including satellite imagery, geospatial metadata, and textual descriptions. By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems. Results demonstrate that our model outperforms task-specific models and existing LLMs on diverse geospatial tasks, effectively addressing the multimodality nature while achieving competitive results on the zero-shot geospatial tasks. Our code will be released after publication.
中文: 本文提出了OmniGeo,一种专为地理空间人工智能设计的多模态大语言模型,它融合文本、图像和空间数据,提升了准确性,并在多种地理空间任务中超越了现有模型。
English: This paper introduces OmniGeo, a multimodal large language model designed for geospatial AI that integrates text, images, and spatial data to improve accuracy and outperform existing models in various geospatial tasks.

Authors:Matteo Cinelli, Stefano Cresci, Walter Quattrociocchi, Maurizio Tesconi, Paola Zola
Title: Coordinated Inauthentic Behavior and Information Spreading on Twitter
Abstract:
We explore the effects of coordinated users (i.e., users characterized by an unexpected, suspicious, or exceptional similarity) in information spreading on Twitter by quantifying the efficacy of their tactics in deceiving feed algorithms to maximize information outreach. In particular, we investigate the behavior of coordinated accounts within a large set of retweet-based information cascades identifying key differences between coordinated and non-coordinated accounts in terms of position within the cascade, action delay and outreach. On average, coordinated accounts occupy higher positions of the information cascade (i.e., closer to the root), spread messages faster and involve a slightly higher number of users. When considering cascade metrics such as size, number of edges and height, we observe clear differences among information cascades that are associated to a systematically larger proportion of coordinated accounts, as confirmed by comparisons with statistical null models. To further characterize the activity of coordinated accounts we introduce two new measures capturing their infectivity within the information cascade (i.e., their ability to involve other users) and their interaction with non-coordinated accounts. Finally, we find that the interaction pattern between the two classes of users follows a saturation-like process. A larger-scale targeting of non-coordinated users does not require a larger amount of coordinated accounts after a threshold value approximately 50%, after which involving more coordinated accounts within a cascade yields a null marginal effect. Our results contribute to shed light on the role of coordinated accounts and their effect on information diffusion.
中文: 协调用户在推特转发信息流中占据更高位置,传播速度更快且覆盖面更广,当其比例超过约50%时对非协调用户的影响力达到饱和,揭示了算法操纵对信息扩散的显著影响。
English: Coordinated Twitter users strategically position themselves higher in retweet cascades, spreading information faster and with greater outreach by exploiting algorithmic vulnerabilities, with their influence saturating once they exceed about 50% of a cascade's participants.

Authors:Sarah Seifi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille
Title: Complying with the EU AI Act: Innovations in Explainable and User-Centric Hand Gesture Recognition
Abstract:
The EU AI Act underscores the importance of transparency, user-centricity, and robustness in AI systems, particularly for high-risk systems. In response, we present advancements in XentricAI, an explainable hand gesture recognition (HGR) system designed to meet these regulatory requirements. XentricAI adresses fundamental challenges in HGR, such as the opacity of black-box models using explainable AI methods and the handling of distributional shifts in real-world data through transfer learning techniques. We extend an existing radar-based HGR dataset by adding 28,000 new gestures, with contributions from multiple users across varied locations, including 24,000 out-of-distribution gestures. Leveraging this real-world dataset, we enhance XentricAI's capabilities by integrating a variational autoencoder module for improved gesture anomaly detection, incorporating user-specific thresholding. This integration enables the identification of 11.50% more anomalous gestures. Our extensive evaluations demonstrate a 97.5% sucess rate in characterizing these anomalies, significantly improving system explainability. Furthermore, the implementation of transfer learning techniques has shown a substantial increase in user adaptability, with an average improvement of at least 15.17%. This work contributes to the development of trustworthy AI systems by providing both technical advancements and regulatory compliance, offering a commercially viable solution that aligns with the EU AI Act requirements.
中文: 欧盟AI法案强调AI系统的透明度和稳健性,XentricAI作为可解释的手势识别系统,通过改进异常检测和用户适应性,满足了法规要求并提供了商业可行的解决方案。
English: The EU AI Act promotes transparency and robustness in AI, leading to the development of XentricAI, an explainable hand gesture recognition system that enhances anomaly detection and user adaptability through advanced methods, ensuring regulatory compliance and commercial viability.

Authors:Chengran Yang, Zhensu Sun, Hong Jin Kang, Jieke Shi, David Lo
Title: Think Like Human Developers: Harnessing Community Knowledge for Structured Code Reasoning
Abstract:
Large Language Models (LLMs) have significantly advanced automated code generation, yet they struggle with complex coding tasks requiring multi-step logical reasoning. High-quality reasoning data is crucial for improving LLMs' reasoning capabilities, but such datasets remain scarce. Existing approaches either rely on computationally expensive reinforcement learning (RL) or error-prone reasoning chains synthesized by LLMs, posing challenges in scalability and accuracy. To address this challenge, we propose SVRC (Structured and Validated Reasoning Chains for Code Generation), a novel framework that mines, restructures, and enriches reasoning chains from community-driven discussions on software engineering platforms. SVRC refines unstructured and incomplete discussions of coding problems by aligning them with Software Development Life Cycle (SDLC) principles, ensuring that reasoning chains capture real-world problem-solving strategies and support iterative refinement. To evaluate the effectiveness of SVRC, we introduce CodeThinker, an LLM fine-tuned on 12,444 reasoning-augmented samples generated by SVRC. Experiments on LiveCodeBench show that CodeThinker surpasses its base model by 42.86\% on medium-level code problems in terms of pass@1 and outperforms GPT-4o-mini and GPT-4o by 73.14\% and 115.86\%, respectively. Our ablation study further highlights that each component of SVRC contributes to the reasoning capabilities of CodeThinker.
中文: SVRC框架通过将社区讨论重构为经过验证的推理链来增强大语言模型的代码生成能力,其微调模型CodeThinker在复杂编程任务中显著优于基础模型及GPT系列模型。
English: The SVRC framework enhances LLMs' code generation by structuring community-driven discussions into validated reasoning chains, leading to CodeThinker—a fine-tuned model that significantly outperforms base models and GPT variants on complex coding tasks.

Authors:Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Yi-Zhe Song
Title: SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
Abstract:
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
中文: 本研究通过将稳定扩散模型与CLIP进行协同整合,无需重新训练即可在多项草图任务中实现最优性能,从而克服了基础模型在草图理解方面的局限性。
English: This study overcomes the limitations of foundation models in sketch understanding by synergistically integrating Stable Diffusion with CLIP, achieving state-of-the-art performance across multiple sketch tasks without retraining.

Authors:Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Longteng Guo, Zhihua Wei, Jing Liu
Title: FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
Abstract:
The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.
中文摘要:FlexVLN提出了一种分层方法,将基于监督学习的导航能力与大型语言模型的规划能力相结合,在视觉与语言导航任务中实现了卓越的跨数据集泛化性能。
English Summary: FlexVLN introduces a hierarchical approach combining supervised learning for navigation with large language models for planning, achieving superior cross-dataset generalization in vision-and-language navigation tasks.

Authors:Mufan Liu, Qi Yang, He Huang, Wenjie Huang, Zhenlong Yuan, Zhu Li, Yiling Xu
Title: Light4GS: Lightweight Compact 4D Gaussian Splatting Generation via Context Model
Abstract:
3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a \textbf{Light}weight \textbf{4}D\textbf{GS} framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS that is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: (1) a spatio-temporal significance pruning strategy that eliminates over 64\% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and (2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure to enable efficient multiscale latent embedding compression. Our approach achieves over 120x compression and increases rendering FPS up to 20\% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods, revealing the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality.
中文:Light4GS是一种轻量级4D高斯泼溅框架,通过剪枝基元与压缩潜在嵌入,实现了超过120倍的压缩和20%的渲染加速,且不损失画质。
English: Light4GS is a lightweight 4D Gaussian Splatting framework that achieves over 120x compression and 20% faster rendering by pruning primitives and compressing latent embeddings, without sacrificing quality.

Authors:Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wang, Weipeng Chen, Jing Liu
Title: Efficient Motion-Aware Video MLLM
Abstract:
Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce EMA, an Efficient Motion-Aware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
中文: 提出的EMA模型通过采用运动感知GOP编码器融合压缩视频流中的空间与运动信息,在降低推理成本的同时,在多项基准测试中实现了最优性能,显著提升了视频处理效率与运动感知能力。
English: The proposed EMA model enhances video processing efficiency and motion awareness by using a motion-aware GOP encoder that integrates spatial and motion information from compressed video streams, achieving top performance on benchmarks while reducing inference costs.

Authors:Guangqian Guo, Yong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao
Title: Segment Any-Quality Images with Generative Latent Space Enhancement
Abstract:
Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.
中文摘要:GleSAM通过生成式潜在空间增强技术提升分割一切模型在低质量图像上的鲁棒性,仅需少量额外参数即可适应多种图像退化情况并保持泛化能力。
English Summary: GleSAM enhances Segment Anything Models' robustness on low-quality images through generative latent space enhancement, maintaining performance across diverse degradations with minimal added parameters.

Authors:Mayank Kumar, Jiaqi Xue, Mengxin Zheng, Qian Lou
Title: TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation
Abstract:
Fully Homomorphic Encryption over the torus (TFHE) enables computation on encrypted data without decryption, making it a cornerstone of secure and confidential computing. Despite its potential in privacy preserving machine learning, secure multi party computation, private blockchain transactions, and secure medical diagnostics, its adoption remains limited due to cryptographic complexity and usability challenges. While various TFHE libraries and compilers exist, practical code generation remains a hurdle. We propose a compiler integrated framework to evaluate LLM inference and agentic optimization for TFHE code generation, focusing on logic gates and ReLU activation. Our methodology assesses error rates, compilability, and structural similarity across open and closedsource LLMs. Results highlight significant limitations in off-the-shelf models, while agentic optimizations such as retrieval augmented generation (RAG) and few-shot prompting reduce errors and enhance code fidelity. This work establishes the first benchmark for TFHE code generation, demonstrating how LLMs, when augmented with domain-specific feedback, can bridge the expertise gap in FHE code generation.
中文:该研究提出了一种编译器集成框架,用于评估TFHE代码生成的LLM推理与智能体优化,结果表明尽管现成模型存在局限,但通过检索增强生成和小样本提示等优化技术能显著降低错误率并提升代码保真度,首次建立了弥合全同态加密代码生成专业差距的基准。
English: The proposed compiler-integrated framework evaluates LLM inference and agentic optimization for TFHE code generation, showing that while standard models struggle, techniques like RAG and few-shot prompting significantly improve error rates and code fidelity, establishing the first benchmark for bridging the FHE expertise gap.

Authors:Jun-Gi Jang, Jingrui He, Andrew Margenot, Hanghang Tong
Title: Tensor Convolutional Network for Higher-Order Interaction Prediction in Sparse Tensors
Abstract:
Many real-world data, such as recommendation data and temporal graphs, can be represented as incomplete sparse tensors where most entries are unobserved. For such sparse tensors, identifying the top-k higher-order interactions that are most likely to occur among unobserved ones is crucial. Tensor factorization (TF) has gained significant attention in various tensor-based applications, serving as an effective method for finding these top-k potential interactions. However, existing TF methods primarily focus on effectively fusing latent vectors of entities, which limits their expressiveness. Since most entities in sparse tensors have only a few interactions, their latent representations are often insufficiently trained. In this paper, we propose TCN, an accurate and compatible tensor convolutional network that integrates seamlessly with existing TF methods for predicting higher-order interactions. We design a highly effective encoder to generate expressive latent vectors of entities. To achieve this, we propose to (1) construct a graph structure derived from a sparse tensor and (2) develop a relation-aware encoder, TCN, that learns latent representations of entities by leveraging the graph structure. Since TCN complements traditional TF methods, we seamlessly integrate TCN with existing TF methods, enhancing the performance of predicting top-k interactions. Extensive experiments show that TCN integrated with a TF method outperforms competitors, including TF methods and a hyperedge prediction method. Moreover, TCN is broadly compatible with various TF methods and GNNs (Graph Neural Networks), making it a versatile solution.
Chinese: 本文提出TCN,一种张量卷积网络,通过从稀疏数据构建图结构并利用关系感知编码器,增强现有张量分解方法,以更准确地预测高阶交互中的top-k结果。
English: The paper introduces TCN, a tensor convolutional network that enhances existing tensor factorization methods by constructing a graph from sparse data and using a relation-aware encoder to improve the prediction of top-k higher-order interactions.

Authors:Ehsan Latif, Xiaoming Zhai
Title: Privacy-Preserved Automated Scoring using Federated Learning for Educational Research
Abstract:
Data privacy remains a critical concern in educational research, requiring strict adherence to ethical standards and regulatory protocols. While traditional approaches rely on anonymization and centralized data collection, they often expose raw student data to security vulnerabilities and impose substantial logistical overhead. In this study, we propose a federated learning (FL) framework for automated scoring of educational assessments that eliminates the need to share sensitive data across institutions. Our approach leverages parameter-efficient fine-tuning of large language models (LLMs) with Low-Rank Adaptation (LoRA), enabling each client (school) to train locally while sharing only optimized model updates. To address data heterogeneity, we implement an adaptive weighted aggregation strategy that considers both client performance and data volume. We benchmark our model against two state-of-the-art FL methods and a centralized learning baseline using NGSS-aligned multi-label science assessment data from nine middle schools. Results show that our model achieves the highest accuracy (94.5%) among FL approaches, and performs within 0.5-1.0 percentage points of the centralized model on these metrics. Additionally, it achieves comparable rubric-level scoring accuracy, with only a 1.3% difference in rubric match and a lower score deviation (MAE), highlighting its effectiveness in preserving both prediction quality and interpretability.
中文: 本研究提出采用LoRA适配大语言模型的联邦学习框架进行教育评估自动评分,通过仅共享模型更新而非原始学生数据,在保护数据隐私的同时实现了接近集中式模型的准确率。
English: This study introduces a federated learning framework using LoRA-adapted LLMs for automated educational assessment scoring, which achieves near-centralized model accuracy while preserving data privacy by sharing only model updates instead of raw student data.

Authors:Shunqi Mao, Chaoyi Zhang, Weidong Cai
Title: Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
Abstract:
Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by contrastively reducing language biases or amplifying the weights of visual embedding during decoding. However, these approaches remain limited in their ability to capture fine-grained visual details. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. By magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning capabilities.
Chinese: 提出的感知放大器(PM)方法通过迭代隔离并放大相关视觉标记,在解码过程中增强对细粒度视觉细节的审视,有效缓解视觉语言模型的幻觉问题并提升生成响应的准确性。
English: The proposed Perception Magnifier (PM) method iteratively isolates and magnifies relevant visual tokens to enhance fine-grained visual scrutiny during decoding, effectively mitigating visual hallucination and improving response accuracy in vision-language models.

Authors:Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki
Title: Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers
Abstract:
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
中文:Cosh-DiT系统采用混合扩散变换器,通过音频生成同步动作再转化为逼真视频,实现了语音与生动手势的无缝对齐,能合成具有自然流畅动作和丰富表情的讲话手势视频。
English: The proposed Cosh-DiT system employs hybrid Diffusion Transformers to synthesize lifelike co-speech gesture videos by first generating synchronized motion from audio and then converting it into realistic video, achieving seamless alignment between speech and expressive gestures.

Authors:Luyang Fang, Ehsan Latif, Haoran Lu, Yifan Zhou, Ping Ma, Xiaoming Zhai
Title: Efficient Multi-Task Inferencing: Model Merging with Gromov-Wasserstein Feature Alignment
Abstract:
Automatic scoring of student responses enhances efficiency in education, but deploying a separate neural network for each task increases storage demands, maintenance efforts, and redundant computations. To address these challenges, this paper introduces the Gromov-Wasserstein Scoring Model Merging (GW-SMM) method, which merges models based on feature distribution similarities measured via the Gromov-Wasserstein distance. Our approach begins by extracting features from student responses using individual models, capturing both item-specific context and unique learned representations. The Gromov-Wasserstein distance then quantifies the similarity between these feature distributions, identifying the most compatible models for merging. Models exhibiting the smallest pairwise distances, typically in pairs or trios, are merged by combining only the shared layers preceding the classification head. This strategy results in a unified feature extractor while preserving separate classification heads for item-specific scoring. We validated our approach against human expert knowledge and a GPT-o1-based merging method. GW-SMM consistently outperformed both, achieving a higher micro F1 score, macro F1 score, exact match accuracy, and per-label accuracy. The improvements in micro F1 and per-label accuracy were statistically significant compared to GPT-o1-based merging (p=0.04, p=0.01). Additionally, GW-SMM reduced storage requirements by half without compromising much accuracy, demonstrating its computational efficiency alongside reliable scoring performance.
Chinese Summary: 本文提出的GW-SMM方法通过Gromov-Wasserstein距离度量特征分布相似性来合并学生自动评分模型,在保持精度的同时显著提升了各项性能指标并将存储需求减半。
English Summary: This paper introduces the GW-SMM method that merges neural models for automated student response scoring by measuring feature distribution similarities through Gromov-Wasserstein distance, achieving superior performance metrics and halving storage needs while maintaining accuracy.

Authors:Vincent Liu, Ehsan Latif, Xiaoming Zhai
Title: Advancing Education through Tutoring Systems: A Systematic Literature Review
Abstract:
This study systematically reviews the transformative role of Tutoring Systems, encompassing Intelligent Tutoring Systems (ITS) and Robot Tutoring Systems (RTS), in addressing global educational challenges through advanced technologies. As many students struggle with proficiency in core academic areas, Tutoring Systems emerge as promising solutions to bridge learning gaps by delivering personalized and adaptive instruction. ITS leverages artificial intelligence (AI) models, such as Bayesian Knowledge Tracing and Large Language Models, to provide precise cognitive support, while RTS enhances social and emotional engagement through human-like interactions. This systematic review, adhering to the PRISMA framework, analyzed 86 representative studies. We evaluated the pedagogical and technological advancements, engagement strategies, and ethical considerations surrounding these systems. Based on these parameters, Latent Class Analysis was conducted and identified three distinct categories: computer-based ITS, robot-based RTS, and multimodal systems integrating various interaction modes. The findings reveal significant advancements in AI techniques that enhance adaptability, engagement, and learning outcomes. However, challenges such as ethical concerns, scalability issues, and gaps in cognitive adaptability persist. The study highlights the complementary strengths of ITS and RTS, proposing integrated hybrid solutions to maximize educational benefits. Future research should focus on bridging gaps in scalability, addressing ethical considerations comprehensively, and advancing AI models to support diverse educational needs.
中文: 本系统综述指出智能与机器人辅导系统通过人工智能个性化和社交互动弥补教育差距,但面临可扩展性与伦理挑战,未来需通过混合解决方案加以解决。
English: This systematic review highlights how Intelligent and Robot Tutoring Systems address educational gaps through AI-driven personalization and social engagement, yet face challenges in scalability and ethics that future hybrid solutions must resolve.

Authors:Xinyi Yang, Runzhe Zhan, Derek F. Wong, Shu Yang, Junchao Wu, Lidia S. Chao
Title: Rethinking Prompt-based Debiasing in Large Language Models
Abstract:
Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI. While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases. Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model. Experimental results indicate that prompt-based is often superficial; for instance, the Llama2-7B-Chat model misclassified over 90% of unbiased content as biased, despite achieving high accuracy in identifying bias issues on the BBQ dataset. Additionally, specific evaluation and question settings in bias benchmarks often lead LLMs to choose "evasive answers", disregarding the core of the question and the relevance of the response to the context. Moreover, the apparent success of previous methods may stem from flawed evaluation metrics. Our research highlights a potential "false prosperity" in prompt-base efforts and emphasizes the need to rethink bias metrics to ensure truly trustworthy AI.
中文摘要:研究表明,基于提示的大语言模型偏见检测常产生表面结果,如Llama2-7B-Chat模型错误分类超90%无偏见内容并依赖回避性回答,揭示了当前方法的"虚假繁荣"及改进偏见评估指标的必要性。
English Summary: The study reveals that prompt-based bias detection in LLMs often yields superficial results, with models like Llama2-7B-Chat misclassifying unbiased content and relying on evasive answers, exposing a "false prosperity" in current methods and the need for improved bias metrics.

Authors:Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Zixin Zhu, Sanping Zhou, Ming Yang, Le Wang
Title: Versatile Multimodal Controls for Expressive Talking Human Animation
Abstract:
In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ``directly guided'' through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions.
中文: VersaAnimator是一种多功能框架,可从任意肖像图像生成富有表现力的讲话人体视频,通过音频驱动和文本提示控制实现逼真、同步的全身动作。
English: VersaAnimator is a versatile framework that synthesizes expressive talking human videos from portrait images, using audio-driven motion generation and text-guided control to produce realistic, synchronized movements.

Authors:Zhangming Chan, Xiuying Chen, Yongliang Wang, Juntao Li, Zhiqiang Zhang, Kun Gai, Dongyan Zhao, Rui Yan
Title: Stick to Facts: Towards Fidelity-oriented Product Description Generation
Abstract:
Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap, we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, allowing FPDG to attend to keywords by attending to their entity labels. Experiments conducted on a large-scale real-world product description dataset show that our model achieves state-of-the-art performance in terms of both traditional generation metrics and human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.
中文: 本文提出FPDG模型,通过引入实体标签和关键词记忆机制来提升产品描述生成的忠实度,实验表明该模型将生成描述的准确性提高了25%,并取得了最优性能。
English: This paper introduces FPDG, a model that enhances fidelity in product description generation by incorporating entity labels and a keyword memory, achieving a 25% improvement in faithfulness and state-of-the-art results.

Authors:Xiaoxiao Liu, Qingying Xiao, Junying Chen, Xiangyi Feng, Xiangbo Wu, Bairui Zhang, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang
Title: Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges
Abstract:
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.
中文: 大型语言模型在门诊转诊任务中应用增多但缺乏标准化评估,本研究提出静态与动态评估框架,发现其相比BERT类模型优势有限,但在交互式提问中展现出潜力。
English: Large language models are being used in outpatient referral tasks but lack standardized evaluation, so this study proposes a static and dynamic framework to assess their capabilities, finding limited advantages over BERT-like models but potential in interactive questioning.

Authors:Yunhao Li, Yifan Jiao, Dan Meng, Heng Fan, Libo Zhang
Title: Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking
Abstract:
Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information that is unique and essential for object tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose \textbf{TRACT}, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specifically, we introduce a \textit{Trajectory Consistency Reinforcement} (\textbf{TCR}) strategy, that benefits tracking performance by improving target identity and category consistency. In addition, we present \textbf{TraCLIP}, a plug-and-play trajectory classification module. It integrates \textit{Trajectory Feature Aggregation} (\textbf{TFA}) and \textit{Trajectory Semantic Enrichment} (\textbf{TSE}) strategies to fully leverage trajectory information from visual and language perspectives for enhancing the classification results. Extensive experiments on OV-TAO show that our TRACT significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT. Code will be released.
中文: TRACT通过轨迹一致性强化和轨迹特征聚合策略,在开放词汇多目标跟踪中利用轨迹信息提升目标关联与分类能力,显著优化了跟踪性能。
English: TRACT introduces trajectory-based strategies to enhance object association and classification in open-vocabulary multi-object tracking, demonstrating significant performance improvements through trajectory consistency reinforcement and integrated feature aggregation.

Authors:Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Title: Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Abstract:
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
中文: Seedream 2.0作为原生中英双语图像生成基础模型,通过融合自研语言模型和优化训练方法,有效解决了文化细节理解与文字渲染的不足,在多维度评估中展现出顶尖性能。
English: Seedream 2.0 is a bilingual image generation model that overcomes limitations in cultural understanding and text rendering by integrating native language processing and advanced training techniques, achieving state-of-the-art performance across multiple metrics.

Authors:Tianyu Chen, Yasi Zhang, Zhendong Wang, Ying Nian Wu, Oscar Leong, Mingyuan Zhou
Title: Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation
Abstract:
Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.
中文: 去噪分数蒸馏(DSD)是一种创新方法,它先在噪声数据上训练生成模型,再将其提炼为能生成高质量输出的高效单步生成器,即使在低质量数据条件下也能显著提升性能。
English: Denoising Score Distillation (DSD) is a novel method that trains generative models on corrupted data, then distills them into efficient one-step generators that produce high-quality outputs, enhancing performance even with low-quality inputs.

Authors:Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su, Daqi Liu
Title: Just Functioning as a Hook for Two-Stage Referring Multi-Object Tracking
Abstract:
Referring Multi-Object Tracking (RMOT) aims to localize target trajectories in videos specified by natural language expressions. Despite recent progress, the intrinsic relationship between the two subtasks of tracking and referring in RMOT has not been fully studied. In this paper, we present a systematic analysis of their interdependence, revealing that current two-stage Referring-by-Tracking (RBT) frameworks remain fundamentally limited by insufficient modeling of subtask interactions and inflexible reliance on semantic alignment modules like CLIP. To this end, we propose JustHook, a novel two-stage RBT framework where a Hook module is firstly designed to redefine the linkage between subtasks. The Hook is built centered on grid sampling at the feature-level and is used for context-aware target feature extraction. Moreover, we propose a Parallel Combined Decoder (PCD) that learns in a unified joint feature space rather than relying on pre-defined cross-modal embeddings. Our design not only enhances the interpretability and modularity but also significantly improves the generalization. Extensive experiments on Refer-KITTI, Refer-KITTI-V2, and Refer-Dance demonstrate that JustHook achieves state-of-the-art performance, improving the HOTA by +6.9\% on Refer-KITTI-V2 with superior efficiency. Code will be available soon.
中文: 本文提出JustHook框架,通过设计Hook模块和并行组合解码器重构子任务关联,在统一特征空间中学习,显著提升了多目标跟踪与语言指称的交互性能,在多个基准测试中达到最优效果。
English: This paper introduces JustHook, a novel two-stage Referring-by-Tracking framework that enhances subtask interaction through a Hook module and a Parallel Combined Decoder, achieving state-of-the-art performance on multiple benchmarks with improved generalization and efficiency.

Authors:Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, Xuelong Li
Title: NFIG: Autoregressive Image Generation with Next-Frequency Prediction
Abstract:
Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.
中文摘要:NFIG框架通过将图像生成分解为多个频率引导的阶段,先以低频分量构建全局结构再逐步添加细节,从而提升生成质量并显著提高计算效率。
English Summary: The NFIG framework improves image generation by decomposing the process into frequency-guided stages, starting with low-frequency components for global structure and progressively adding details, which enhances image quality and computational efficiency.

Authors:Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: Speculative Decoding for Multi-Sample Inference
Abstract:
We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios, such as self-consistency and Best-of-N sampling. Our method exploits the intrinsic consensus of parallel generation paths to synthesize high-quality draft tokens without requiring auxiliary models or external databases. By dynamically analyzing structural patterns across parallel reasoning paths through a probabilistic aggregation mechanism, it identifies consensus token sequences that align with the decoding distribution. Evaluations on mathematical reasoning benchmarks demonstrate a substantial improvement in draft acceptance rates over baselines, while reducing the latency in draft token construction. This work establishes a paradigm shift for efficient multi-sample inference, enabling seamless integration of speculative decoding with sampling-based reasoning techniques.
中文摘要:本文提出一种面向多样本推理场景的推测解码方法,通过概率聚合机制分析并行推理路径的结构模式来合成高质量草稿标记,在提升采纳率的同时显著降低了延迟且无需外部依赖。
English Summary: This paper introduces a speculative decoding method for multi-sample reasoning that leverages consensus across parallel generation paths to produce high-quality draft tokens, significantly improving acceptance rates and reducing latency without external resources.

Authors:Zining Chen, Zhicheng Zhao, Fei Su, Xiaoqin Zhang, Shijian Lu
Title: Data-Efficient Generalization for Zero-shot Composed Image Retrieval
Abstract:
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The S-Set exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
Chinese: 提出的数据高效泛化(DeG)框架通过引入文本补充模块和语义集,有效缓解模态差异和分布偏移,在零样本组合图像检索中仅需少量训练数据即可超越现有最优方法,并显著提升训练和推理效率。
English: The proposed Data-efficient Generalization (DeG) framework enhances zero-shot composed image retrieval by incorporating a Textual Supplement module and Semantic-Set to mitigate modality discrepancy and distribution shift, achieving state-of-the-art performance with less training data and improved efficiency.

Authors:Simon A. Aytes, Jinheon Baek, Sung Ju Hwang
Title: Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Abstract:
Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms--Conceptual Chaining, Chunked Symbolism, and Expert Lexicons--each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.
中文:提出的思维草图(SoT)框架通过认知启发的推理范式,在多种推理任务中保持甚至提升准确率的同时,将计算开销降低了高达84%。
English: The proposed Sketch-of-Thought (SoT) framework reduces computational overhead by up to 84% through cognitively inspired reasoning paradigms while maintaining or even improving accuracy across diverse reasoning tasks.

Authors:Siyu Ma, Wenxin Du, Chang Yu, Ying Jiang, Zeshun Zong, Tianyi Xie, Yunuo Chen, Yin Yang, Xuchen Han, Chenfanfu Jiang
Title: GRIP: A General Robotic Incremental Potential Contact Simulation Dataset for Unified Deformable-Rigid Coupled Grasping
Abstract:
Grasping is fundamental to robotic manipulation, and recent advances in large-scale grasping datasets have provided essential training data and evaluation benchmarks, accelerating the development of learning-based methods for robust object grasping. However, most existing datasets exclude deformable bodies due to the lack of scalable, robust simulation pipelines, limiting the development of generalizable models for compliant grippers and soft manipulands. To address these challenges, we present GRIP, a General Robotic Incremental Potential contact simulation dataset for universal grasping. GRIP leverages an optimized Incremental Potential Contact (IPC)-based simulator for multi-environment data generation, achieving up to 48x speedup while ensuring efficient, intersection- and inversion-free simulations for compliant grippers and deformable objects. Our fully automated pipeline generates and evaluates diverse grasp interactions across 1,200 objects and 100,000 grasp poses, incorporating both soft and rigid grippers. The GRIP dataset enables applications such as neural grasp generation and stress field prediction.
中文: GRIP数据集通过优化的基于IPC的模拟器解决了柔性物体抓取数据不足的问题,提供了高效、无干涉的仿真环境,支持兼容性抓取器和柔性物体的多样化交互训练与应用。
English: The GRIP dataset addresses the gap in robotic grasping data for deformable objects by providing a scalable simulation pipeline with optimized IPC-based simulations, enabling efficient and robust training for compliant grippers and soft manipulands across diverse interactions.

Authors:Hongchao Du, Shangyu Wu, Arina Kharlamova, Nan Guan, Chun Jason Xue
Title: FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
Abstract:
Large Language Models (LLMs) face challenges for on-device inference due to high memory demands. Traditional methods to reduce memory usage often compromise performance and lack adaptability. We propose FlexInfer, an optimized offloading framework for on-device inference, addressing these issues with techniques like asynchronous prefetching, balanced memory locking, and flexible tensor preservation. These strategies enhance memory efficiency and mitigate I/O bottlenecks, ensuring high performance within user-specified resource constraints. Experiments demonstrate that FlexInfer significantly improves throughput under limited resources, achieving up to 12.5 times better performance than existing methods and facilitating the deployment of large models on resource-constrained devices.
Chinese: FlexInfer 是一种优化的卸载框架,通过异步预取和平衡内存锁定等技术,在资源受限条件下显著提升设备端大语言模型推理效率,实现高达12.5倍的吞吐量提升,同时保持高性能。
English: FlexInfer is an optimized offloading framework that enhances on-device LLM inference by employing techniques like asynchronous prefetching and balanced memory locking, achieving up to 12.5 times higher throughput under resource constraints while maintaining performance.

Authors:Maresa Schröder, Valentyn Melnychuk, Stefan Feuerriegel
Title: Differentially Private Learners for Heterogeneous Treatment Effects
Abstract:
Patient data is widely used to estimate heterogeneous treatment effects and thus understand the effectiveness and safety of drugs. Yet, patient data includes highly sensitive information that must be kept private. In this work, we aim to estimate the conditional average treatment effect (CATE) from observational data under differential privacy. Specifically, we present DP-CATE, a novel framework for CATE estimation that is Neyman-orthogonal and further ensures differential privacy of the estimates. Our framework is highly general: it applies to any two-stage CATE meta-learner with a Neyman-orthogonal loss function, and any machine learning model can be used for nuisance estimation. We further provide an extension of our DP-CATE, where we employ RKHS regression to release the complete CATE function while ensuring differential privacy. We demonstrate our DP-CATE across various experiments using synthetic and real-world datasets. To the best of our knowledge, we are the first to provide a framework for CATE estimation that is Neyman-orthogonal and differentially private.
中文摘要:本文提出DP-CATE框架,在保证差分隐私的前提下从观察数据中估计条件平均处理效应,适用于任何具有尼曼正交损失函数的两阶段元学习器。
English Summary: This paper introduces DP-CATE, a novel framework for estimating conditional average treatment effects from observational data while ensuring differential privacy, applicable to any two-stage meta-learner with Neyman-orthogonal loss functions.

Authors:Yannian Gu, Wenhui Lei, Hanyu Chen, Xiaofan Zhang, Shaoting Zhang
Title: Interactive Segmentation and Report Generation for CT Images
Abstract:
Automated CT report generation plays a crucial role in improving diagnostic accuracy and clinical workflow efficiency. However, existing methods lack interpretability and impede patient-clinician understanding, while their static nature restricts radiologists from dynamically adjusting assessments during image review. Inspired by interactive segmentation techniques, we propose a novel interactive framework for 3D lesion morphology reporting that seamlessly generates segmentation masks with comprehensive attribute descriptions, enabling clinicians to generate detailed lesion profiles for enhanced diagnostic assessment. To our best knowledge, we are the first to integrate the interactive segmentation and structured reports in 3D CT medical images. Experimental results across 15 lesion types demonstrate the effectiveness of our approach in providing a more comprehensive and reliable reporting system for lesion segmentation and capturing. The source code will be made publicly available following paper acceptance.
中文: 本文提出了一种交互式三维CT病灶报告框架,将分割掩码与详细描述相结合,提高了诊断准确性和临床医生的操作灵活性。
English: This paper introduces an interactive framework for 3D CT lesion reporting that combines segmentation masks with detailed descriptions, enhancing diagnostic accuracy and clinician control.

Authors:Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo
Title: From Code to Courtroom: LLMs as the New Software Judges
Abstract:
Recently, Large Language Models (LLMs) have been increasingly used to automate SE tasks such as code generation and summarization. However, evaluating the quality of LLM-generated software artifacts remains challenging. Human evaluation, while effective, is very costly and time-consuming. Traditional automated metrics like BLEU rely on high-quality references and struggle to capture nuanced aspects of software quality, such as readability and usefulness. In response, the LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged. Given that LLMs are typically trained to align with human judgment and possess strong coding abilities and reasoning skills, they hold promise as cost-effective and scalable surrogates for human evaluators. Nevertheless, LLM-as-a-Judge research in the SE community is still in its early stages, with many breakthroughs needed. This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLMgenerated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering. While not intended to be a definitive guide, our work aims to foster further research and adoption of LLM-as-a-Judge frameworks within the SE community, ultimately improving the effectiveness and scalability of software artifact evaluation methods.
中文: 该论文主张推进LLM作为评判者的框架,以经济高效地评估LLM生成的软件制品,通过分析当前局限并规划研究路线,旨在到2030年实现可靠、多维度的人工替代评估方案。
English: The paper advocates for advancing LLM-as-a-Judge frameworks to cost-effectively evaluate LLM-generated software artifacts, addressing current limitations and outlining a research roadmap to achieve reliable, multi-faceted assessments by 2030.

Authors:Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, Tianyi Zhou
Title: ATLaS: Agent Tuning via Learning Critical Steps
Abstract:
Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps, such as planning, complex reasoning for intermediate subtasks, and strategic decision-making, are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLaS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training's focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLaS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLaS maintains and improves base LLM skills as generalist agents interacting with diverse environments.
Chinese: ATLaS方法通过仅对专家轨迹中的关键步骤进行微调,以更低成本提升LLM代理的泛化能力,避免过拟合并保持其多环境交互的通用技能。
English: The ATLaS method enhances LLM agent tuning by selectively finetuning on critical steps from expert trajectories, reducing costs and improving generalization across tasks without overfitting.

Authors:Tianchi Ren, Haibo Hu, Jiacheng Zuo, Xinhong Chen, Jianping Wang, Chun Jason Xue, Jen-Ming Wu, Nan Guan
Title: CoT-VLM4Tar: Chain-of-Thought Guided Vision-Language Models for Traffic Anomaly Resolution
Abstract:
With the acceleration of urbanization, modern urban traffic systems are becoming increasingly complex, leading to frequent traffic anomalies. These anomalies encompass not only common traffic jams but also more challenging issues such as phantom traffic jams, intersection deadlocks, and accident liability analysis, which severely impact traffic flow, vehicular safety, and overall transportation efficiency. Currently, existing solutions primarily rely on manual intervention by traffic police or artificial intelligence-based detection systems. However, these methods often suffer from response delays and inconsistent management due to inadequate resources, while AI detection systems, despite enhancing efficiency to some extent, still struggle to handle complex traffic anomalies in a real-time and precise manner. To address these issues, we propose CoT-VLM4Tar: (Chain of Thought Visual-Language Model for Traffic Anomaly Resolution), this innovative approach introduces a new chain-of-thought to guide the VLM in analyzing, reasoning, and generating solutions for traffic anomalies with greater reasonable and effective solution, and to evaluate the performance and effectiveness of our method, we developed a closed-loop testing framework based on the CARLA simulator. Furthermore, to ensure seamless integration of the solutions generated by the VLM with the CARLA simulator, we implement an itegration module that converts these solutions into executable commands. Our results demonstrate the effectiveness of VLM in the resolution of real-time traffic anomalies, providing a proof-of-concept for its integration into autonomous traffic management systems.
Chinese: 摘要介绍了CoT-VLM4Tar,一种思维链视觉语言模型,通过CARLA模拟器中的闭环测试框架实时分析和解决复杂交通异常,证明了其在自主交通管理中的有效性。
English: The abstract introduces CoT-VLM4Tar, a chain-of-thought visual-language model designed to analyze and resolve complex traffic anomalies in real-time through a closed-loop testing framework in the CARLA simulator, demonstrating its effectiveness for autonomous traffic management.

Authors:Haoyang Liu, Jie Wang, Zijie Geng, Xijun Li, Yuxuan Zong, Fangzhou Zhu, Jianye Hao, Feng Wu
Title: Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming
Abstract:
Leveraging machine learning (ML) to predict an initial solution for mixed-integer linear programming (MILP) has gained considerable popularity in recent years. These methods predict a solution and fix a subset of variables to reduce the problem dimension. Then, they solve the reduced problem to obtain the final solutions. However, directly fixing variable values can lead to low-quality solutions or even infeasible reduced problems if the predicted solution is not accurate enough. To address this challenge, we propose an Alternating prediction-correction neural solving framework (Apollo-MILP) that can identify and select accurate and reliable predicted values to fix. In each iteration, Apollo-MILP conducts a prediction step for the unfixed variables, followed by a correction step to obtain an improved solution (called reference solution) through a trust-region search. By incorporating the predicted and reference solutions, we introduce a novel Uncertainty-based Error upper BOund (UEBO) to evaluate the uncertainty of the predicted values and fix those with high confidence. A notable feature of Apollo-MILP is the superior ability for problem reduction while preserving optimality, leading to high-quality final solutions. Experiments on commonly used benchmarks demonstrate that our proposed Apollo-MILP significantly outperforms other ML-based approaches in terms of solution quality, achieving over a 50% reduction in the solution gap.
中文:Apollo-MILP框架采用交替预测-校正的神经求解方法,通过基于不确定性的误差上界(UEBO)选择性固定高置信度变量,在保持最优性的同时显著优于其他基于机器学习的方法,将求解间隙降低了50%以上。
English: The Apollo-MILP framework introduces an alternating prediction-correction neural solving approach that selectively fixes high-confidence variables using an Uncertainty-based Error upper BOund (UEBO), significantly outperforming other ML-based methods by reducing the solution gap by over 50% while preserving optimality.

Authors:Hyeon Jeon, Michaël Aupetit, DongHwa Shin, Aeri Cho, Seokhyeon Park, Jinwook Seo
Title: Measuring the Validity of Clustering Validation Datasets
Abstract:
Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation.
中文: 本文提出了调整后的内部验证指标,以可靠地评估和比较不同数据集间的聚类标签匹配度,解决了传统指标因未排除与聚类结构无关的数据属性而存在的局限性。
English: Adjusted internal validation measures are introduced to reliably evaluate and compare cluster-label matching across datasets, addressing the limitations of standard measures that fail to account for data properties unrelated to cluster structure.

Authors:Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han
Title: Re-Imagining Multimodal Instruction Tuning: A Representation View
Abstract:
Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.
中文: 多模态表示调优(MRT)是一种新颖的参数高效方法,通过直接编辑多模态表示来实现卓越性能和对大型多模态模型的直观控制,以极少的可调参数显著超越现有方法。
English: Multimodal Representation Tuning (MRT) is a novel parameter-efficient method that directly edits multimodal representations to achieve superior performance and intuitive control over large multimodal models, significantly outperforming existing approaches with minimal tunable parameters.

Authors:Zeeshan Memon, Chen Ling, Ruochen Kong, Vishwanath Seshagiri, Andreas Zufle, Liang Zhao
Title: Deep Identification of Propagation Trees
Abstract:
Understanding propagation structures in graph diffusion processes, such as epidemic spread or misinformation diffusion, is a fundamental yet challenging problem. While existing methods primarily focus on source localization, they cannot reconstruct the underlying propagation trees i.e., "who infected whom", which are substantial for tracking the propagation pathways and investigate diffusion mechanisms. In this work, we propose Deep Identification of Propagation Trees (DIPT), a probabilistic framework that infers propagation trees from observed diffused states. DIPT models local influence strengths between nodes and leverages an alternating optimization strategy to jointly learn the diffusion mechanism and reconstruct the propagation structure. Extensive experiments on five real-world datasets demonstrate the effectiveness of DIPT in accurately reconstructing propagation trees.
中文: 本研究提出DIPT概率框架,通过建模节点间影响力并联合学习扩散机制,能准确重构图扩散过程中的传播树结构,突破了现有方法仅关注溯源定位的局限。
English: This study introduces DIPT, a probabilistic framework that accurately reconstructs propagation trees in graph diffusion processes by modeling node influence and jointly learning diffusion mechanisms, outperforming existing methods focused solely on source localization.

Authors:Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
Title: VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
Abstract:
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.
中文摘要:针对视频扩散模型常生成不符合物理规律内容的问题,本文提出两阶段生成框架,通过视觉语言模型进行物理感知推理来指导视频生成,实现更符合现实物理规律的运动效果。
English Summary: Video diffusion models often produce physically implausible content, so this paper introduces a two-stage framework that integrates physics-aware reasoning through vision-language models to guide video generation toward more realistic motion.

Authors:Yujie Chen, Haotong Qin, Zhang Zhang, Michelo Magno, Luca Benini, Yawei Li
Title: Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration
Abstract:
State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR) due to their ability to scale linearly sequence length while effectively capturing long-distance dependencies. However, deploying SSMs to edge devices is challenging due to the constraints in memory, computing capacity, and power consumption, underscoring the need for efficient compression strategies. While low-bit quantization is an efficient model compression strategy for reducing size and accelerating IR tasks, SSM suffers substantial performance drops at ultra-low bit-widths (2-4 bits), primarily due to outliers that exacerbate quantization error. To address this challenge, we propose Q-MambaIR, an accurate, efficient, and flexible Quantized Mamba for IR tasks. Specifically, we introduce a Statistical Dynamic-balancing Learnable Scalar (DLS) to dynamically adjust the quantization mapping range, thereby mitigating the peak truncation loss caused by extreme values. Furthermore, we design a Range-floating Flexible Allocator (RFA) with an adaptive threshold to flexibly round values. This approach preserves high-frequency details and maintains the SSM's feature extraction capability. Notably, RFA also enables pre-deployment weight quantization, striking a balance between computational efficiency and model accuracy. Extensive experiments on IR tasks demonstrate that Q-MambaIR consistently outperforms existing quantized SSMs, achieving much higher state-of-the-art (SOTA) accuracy results with only a negligible increase in training computation and storage saving.
中文摘要:Q-MambaIR是一种用于图像复原的量化Mamba模型,通过动态标度调整和浮动范围分配技术,在超低比特位宽下保持高精度,以微小计算代价显著超越现有量化方法。
English Summary: Q-MambaIR is a quantized Mamba model for image restoration that introduces dynamic scaling and flexible allocation techniques to maintain high accuracy at ultra-low bit-widths, outperforming existing methods with minimal computational overhead.

Authors:Zhenhong Hu, Esra Adiyeke, Ziyuan Guan, Divya Vellanki, Jiahang Yu, Ruilin Zhu, Yuanfang Ren, Yingbo Ma, Annanya Sai Vedala, Tezcan Ozrazgat-Baslanti, Azra Bihorac
Title: Unlocking Health Insights with SDoH Data: A Comprehensive Open-Access Database and SDoH-EHR Linkage Tool
Abstract:
Background: Social determinants of health (SDoH) play a crucial role in influencing health outcomes, accounting for nearly 50% of modifiable health factors and bringing to light critical disparities among disadvantaged groups. Despite the significant impact of SDoH, existing data resources often fall short in terms of comprehensiveness, integration, and usability. Methods: To address these gaps, we developed an extensive Exposome database and a corresponding web application, aimed at enhancing data usability and integration with electronic health record (EHR) to foster personalized and informed healthcare. We created a robust database consisting of a wide array of SDoH indicators and an automated linkage tool designed to facilitate effortless integration with EHR. We emphasized a user-friendly interface to cater to researchers, clinicians, and public health professionals. Results: The resultant Exposome database and web application offer an extensive data catalog with enhanced usability features. The automated linkage tool has demonstrated efficiency in integrating SDoH data with EHRs, significantly improving data accessibility. Initial deployment has confirmed scalability and robust spatial data relationships, facilitating precise and contextually relevant healthcare insights. Conclusion: The development of an advanced Exposome database and linkage tool marks a significant step toward enhancing the accessibility and usability of SDoH data. By centralizing and integrating comprehensive SDoH indicators with EHRs, this tool empowers a wide range of users to access high-quality, standardized data. This resource will have a lasting impact on personalized healthcare and equitable health landscape.
中文摘要:本研究开发了一个暴露组数据库和网络应用程序,旨在加强社会健康决定因素数据与电子健康记录的整合和可用性,提高了数据可及性以促进个性化医疗并解决健康不平等问题。
English Summary: This study developed an Exposome database and web application to enhance the integration and usability of social determinants of health (SDoH) data with electronic health records, improving accessibility for personalized healthcare and addressing health disparities.

Authors:Chengjie Ge, Xueyang Fu, Peng He, Kunyu Wang, Chengzhi Cao, Zheng-Jun Zha
Title: EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
Abstract:
Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mamba-based vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba--a specialized model designed for EBVR tasks. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba's robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.
中文: EventMamba通过引入随机窗口偏移和一致遍历序列化方法,克服了现有Mamba模型在基于事件的视频重建中的局限,相比基于Transformer的方法显著提升了计算速度和视觉质量。
English: EventMamba overcomes the limitations of existing Mamba models in event-based video reconstruction by introducing random window offsets and consistent traversal serialization, significantly improving computational speed and visual quality compared to Transformer-based approaches.

Authors:Saurav Sharma, Didier Mutter, Nicolas Padoy
Title: fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models
Abstract:
While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and leverages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.
中文: 提出的fine-CLIP模型通过整合以对象为中心的特征和层次结构,克服了CLIP在零样本外科手术三元组识别中的不足,在CholecT50数据集的新手术场景中展现出卓越性能。
English: The proposed fine-CLIP model overcomes CLIP's limitations in zero-shot surgical triplet recognition by incorporating object-centric features and hierarchical structures, demonstrating superior performance on novel surgical scenarios in the CholecT50 dataset.

Authors:Ninghui Feng, Songning Lai, Xin Zhou, Jiayu Yang, Kunlong Feng, Zhenxiao Yin, Fobao Zhou, Zhangyi Hu, Yutao Yue, Yuxuan Liang, Boyu Wang, Hang Zhao
Title: Towards Reliable Time Series Forecasting under Future Uncertainty: Ambiguity and Novelty Rejection Mechanisms
Abstract:
In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error variance, allows the model to abstain under low confidence, assessed through historical error variance analysis without future ground truth. Novelty rejection, employing Variational Autoencoders and Mahalanobis distance, detects deviations from training data. This dual approach improves forecasting reliability in dynamic environments by reducing errors and adapting to data changes, advancing reliability in complex scenarios.
中文: 该研究提出的双重拒绝机制结合基于预测误差方差的模糊拒绝和使用变分自编码器的新颖性拒绝,有效减少动态环境中的预测误差,提升了时间序列预测的可靠性。
English: The proposed dual rejection mechanism enhances forecasting reliability by combining ambiguity rejection based on prediction error variance and novelty rejection using Variational Autoencoders to reduce errors in dynamic environments.

Authors:Xueyao Zhang, Bo Yang, Xuelin Cao, Zhiwen Yu, George C. Alexandropoulos, Yan Zhang, Merouane Debbah, Chau Yuen
Title: Multi-Agent Deep Reinforcement Learning for Safe Autonomous Driving with RICS-Assisted MEC
Abstract:
Environment sensing and fusion via onboard sensors are envisioned to be widely applied in future autonomous driving networks. This paper considers a vehicular system with multiple self-driving vehicles that is assisted by multi-access edge computing (MEC), where image data collected by the sensors is offloaded from cellular vehicles to the MEC server using vehicle-to-infrastructure (V2I) links. Sensory data can also be shared among surrounding vehicles via vehicle-to-vehicle (V2V) communication links. To improve spectrum utilization, the V2V links may reuse the same frequency spectrum with V2I links, which may cause severe interference. To tackle this issue, we leverage reconfigurable intelligent computational surfaces (RICSs) to jointly enable V2I reflective links and mitigate interference appearing at the V2V links. Considering the limitations of traditional algorithms in addressing this problem, such as the assumption for quasi-static channel state information, which restricts their ability to adapt to dynamic environmental changes and leads to poor performance under frequently varying channel conditions, in this paper, we formulate the problem at hand as a Markov game. Our novel formulation is applied to time-varying channels subject to multi-user interference and introduces a collaborative learning mechanism among users. The considered optimization problem is solved via a driving safety-enabled multi-agent deep reinforcement learning (DS-MADRL) approach that capitalizes on the RICS presence. Our extensive numerical investigations showcase that the proposed reinforcement learning approach achieves faster convergence and significant enhancements in both data rate and driving safety, as compared to various state-of-the-art benchmarks.
中文摘要:本文提出了一种基于可重构智能计算表面的驾驶安全多智能体深度强化学习方法,用于优化自动驾驶网络中车对基础设施与车对车通信的频谱共享问题,在数据速率和安全性能方面均展现出显著优势。
English Summary: This paper proposes a driving safety-enabled multi-agent deep reinforcement learning approach that leverages reconfigurable intelligent computational surfaces to optimize spectrum sharing between vehicle-to-infrastructure and vehicle-to-vehicle communications in autonomous driving networks, demonstrating superior performance in data rate and safety metrics.

Authors:Yifei Zhang, Chang Liu, Jin Wei, Xiaomeng Yang, Yu Zhou, Can Ma, Xiangyang Ji
Title: Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition
Abstract:
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
中文摘要:本文提出的语言感知掩码图像建模(LMIM)方法通过独立分支将语言信息融入视觉重建过程,利用语言学对齐模块提取与视觉无关的特征作为指导,在多个基准测试中实现了最优性能。
English Summary: The proposed Linguistics-aware Masked Image Modeling (LMIM) method enhances scene text recognition by integrating linguistic context into visual reconstruction through a dedicated alignment module, achieving state-of-the-art results across multiple benchmarks.

Authors:Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu
Title: MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation
Abstract:
Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.
中文摘要:MLLM-For3D通过多视角掩码生成与空间一致性策略,将二维多模态推理能力迁移至三维分割任务,在无需标注三维数据的情况下实现了最优性能。
English Summary: MLLM-For3D transfers 2D multimodal reasoning to 3D segmentation through multi-view mask generation and spatial consistency strategies, achieving state-of-the-art performance without labeled 3D training data.

Authors:Yanan Ma, Senkang Hu, Zhengru Fang, Yun Ji, Yiqin Deng, Yuguang Fang
Title: Sense4FL: Vehicular Crowdsensing Enhanced Federated Learning for Object Detection in Autonomous Driving
Abstract:
To accommodate constantly changing road conditions, real-time vision model training is essential for autonomous driving (AD). Federated learning (FL) serves as a promising paradigm to enable autonomous vehicles to train models collaboratively with their onboard computing resources. However, existing vehicle selection schemes for FL all assume predetermined and location-independent vehicles' datasets, neglecting the fact that vehicles collect training data along their routes, thereby resulting in suboptimal vehicle selection. In this paper, we focus on the fundamental perception problem and propose Sense4FL, a vehicular crowdsensing-enhanced FL framework featuring \textit{trajectory-dependent} vehicular \textit{training data collection} to \rev{improve the object detection quality} in AD for a region. To this end, we first derive the convergence bound of FL by considering the impact of both vehicles' uncertain trajectories and uploading probabilities, from which we discover that minimizing the training loss is equivalent to minimizing a weighted sum of local and global earth mover's distance (EMD) between vehicles' collected data distribution and global data distribution. Based on this observation, we formulate the trajectory-dependent vehicle selection and data collection problem for FL in AD. Given that the problem is NP-hard, we develop an efficient algorithm to find the solution with an approximation guarantee. Extensive simulation results have demonstrated the effectiveness of our approach in improving object detection performance compared with existing benchmarks.
中文摘要:本文提出Sense4FL框架,通过基于车辆轨迹的数据采集优化联邦学习中的车辆选择策略,有效提升自动驾驶系统的物体检测性能。
English Summary: This paper introduces Sense4FL, a federated learning framework that enhances autonomous driving perception by selecting vehicles based on their trajectories to optimize data collection and improve object detection accuracy.

Authors:Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
Title: Judge Anything: MLLM as a Judge Across Any Modality
Abstract:
Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.
中文: 本文提出TaskAnything和JudgeAnything基准,用于评估多模态大语言模型在任意模态任务中的表现与评判能力,发现其在多模态理解方面表现良好但在生成任务中存在显著挑战,揭示了跨模态偏见问题。
English: This paper introduces TaskAnything and JudgeAnything benchmarks to evaluate multimodal LLMs' performance and judging capabilities across any-to-any modality tasks, revealing their strengths in multimodal understanding but significant challenges in generation tasks due to cross-modality biases.

Authors:Zhengqing Gao, Dongting Hu, Jia-Wang Bian, Huan Fu, Yan Li, Tongliang Liu, Mingming Gong, Kun Zhang
Title: ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes
Abstract:
3D Gaussian Splatting (3DGS) has made significant strides in novel view synthesis but is limited by the substantial number of Gaussian primitives required, posing challenges for deployment on lightweight devices. Recent methods address this issue by compressing the storage size of densified Gaussians, yet fail to preserve rendering quality and efficiency. To overcome these limitations, we propose ProtoGS to learn Gaussian prototypes to represent Gaussian primitives, significantly reducing the total Gaussian amount without sacrificing visual quality. Our method directly uses Gaussian prototypes to enable efficient rendering and leverage the resulting reconstruction loss to guide prototype learning. To further optimize memory efficiency during training, we incorporate structure-from-motion (SfM) points as anchor points to group Gaussian primitives. Gaussian prototypes are derived within each group by clustering of K-means, and both the anchor points and the prototypes are optimized jointly. Our experiments on real-world and synthetic datasets prove that we outperform existing methods, achieving a substantial reduction in the number of Gaussians, and enabling high rendering speed while maintaining or even enhancing rendering fidelity.
中文: ProtoGS通过引入高斯原型来大幅减少3D高斯泼溅所需的高斯基元数量,在保持甚至提升渲染质量的同时实现了高效渲染。
English: ProtoGS introduces Gaussian prototypes to significantly reduce the number of Gaussian primitives in 3D Gaussian Splatting, enabling efficient rendering while preserving or enhancing visual quality.

Authors:Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, Julian Straub
Title: Sonata: Self-Supervised Learning of Reliable Point Representations
Abstract:
In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.
中文: 本文提出Sonata自监督点云模型,通过遮蔽空间信息和增强特征依赖来克服三维学习中的"几何捷径"问题,在多种任务中显著提升了线性探测精度和数据效率。
English: This paper introduces Sonata, a self-supervised point cloud model that overcomes the "geometric shortcut" limitation in 3D learning by obscuring spatial information and enhancing feature reliance, achieving significant improvements in linear probing accuracy and data efficiency across diverse tasks.

Authors:Pin-Jie Lin, Ernie Chang, Yangyang Shi, Vikas Chandra
Title: Self-Vocabularizing Training for Neural Machine Translation
Abstract:
Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.
中文: 本文提出自词汇化训练方法,通过迭代优化模型训练过程中的词汇表,在减少6-8%词汇量的同时增加独特词符使用,实现了最高1.49 BLEU值的性能提升。
English: This paper introduces self-vocabularizing training, an iterative method that dynamically optimizes vocabulary during model training, achieving up to a 1.49 BLEU improvement by reducing vocabulary size by 6-8% while increasing unique token usage.

Authors:Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad I. Morariu, Chitta Baral, Yezhou Yang
Title: TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark
Abstract:
Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
中文: 本文提出了TextInVision基准,用于评估扩散模型在图像中嵌入文本的能力,通过识别拼写错误和上下文不匹配等常见问题,为AI生成多模态内容的未来发展奠定基础。
English: This paper introduces TextInVision, a comprehensive benchmark for evaluating diffusion models' ability to embed text in images, identifying common errors like spelling inaccuracies and contextual mismatches to advance AI-generated multimodal content.

Authors:Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Title: Levels of Analysis for Large Language Models
Abstract:
Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on the levels of analysis that David Marr proposed for studying information processing systems. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
中文: 该摘要提出借鉴认知科学方法及David Marr的分析框架来理解大语言模型,旨在提供一套解析这类新型智能系统行为特征与内部机制的实用工具。
English: This abstract proposes using cognitive science methods and David Marr's analytical framework to understand large language models, offering a toolkit to decipher their complex behaviors and internal structures.

Authors:Nassim Ali Ousalah, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada
Title: Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation
Abstract:
Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.
本文提出了一种不确定性感知的知识蒸馏框架,通过基于关键点预测不确定性自适应地从教师模型转移知识,从而提升轻量级6DoF姿态估计模型的精度。
This paper presents an uncertainty-aware knowledge distillation framework that enhances the accuracy of lightweight 6DoF pose estimation models by adaptively transferring knowledge from a teacher model based on keypoint prediction uncertainties.

Authors:Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, Zhiqiang Zhang
Title: Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering
Abstract:
Recent advancements in long chain-of-thoughts(long CoTs) have significantly improved the reasoning capabilities of large language models(LLMs). Existing work finds that the capability of long CoT reasoning can be efficiently elicited by tuning on only a few examples and can easily transfer to other tasks. This motivates us to investigate whether long CoT reasoning is a general capability for LLMs. In this work, we conduct an empirical analysis for this question from the perspective of representation. We find that LLMs do encode long CoT reasoning as a general capability, with a clear distinction from vanilla CoTs. Furthermore, domain-specific representations are also required for the effective transfer of long CoT reasoning. Inspired by these findings, we propose GLoRE, a novel representation engineering method to unleash the general long CoT reasoning capabilities of LLMs. Extensive experiments demonstrate the effectiveness and efficiency of GLoRE in both in-domain and cross-domain scenarios.
Chinese: 最新研究表明,长思维链推理是大语言模型中编码的一种通用能力,与标准推理不同,而提出的GLoRE方法能有效激活该能力,在不同任务中提升性能。
English: Recent research reveals that long chain-of-thoughts reasoning is a general capability encoded in large language models, distinct from standard reasoning, and the proposed GLoRE method effectively activates this capability for improved performance across various tasks.

Authors:Bryan Wilie, Samuel Cahyawijaya, Junxian He, Pascale Fung
Title: High-Dimensional Interlingual Representations of Large Language Models
Abstract:
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs--a shared subspace in the representation space. However, evidence regarding this phenomenon is mixed, leaving it unclear whether these models truly develop unified interlingual representations, or present a partially aligned constructs. We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions; and find that multilingual LLMs exhibit inconsistent cross-lingual alignments. To address this, we propose an interlingual representation framework identifying both the shared interlingual semantic subspace and fragmented components, existed due to representational limitations. We introduce Interlingual Local Overlap (ILO) score to quantify interlingual alignment by comparing the local neighborhood structures of high-dimensional representations. We utilize ILO to investigate the impact of single-language fine-tuning on the interlingual representations in multilingual LLMs. Our results indicate that training exclusively on a single language disrupts the alignment in early layers, while freezing these layers preserves the alignment of interlingual representations, leading to improved cross-lingual generalization. These results validate our framework and metric for evaluating interlingual representation, and further underscore that interlingual alignment is crucial for scalable multilingual learning.
中文摘要:多语言大语言模型表现出不一致的跨语言对齐,但提出的语际局部重叠评分和框架表明,在单语言微调期间保持早期层对齐可增强跨语言泛化能力。
English Summary: Multilingual large language models exhibit inconsistent cross-lingual alignments, but the proposed Interlingual Local Overlap score and framework reveal that preserving early layer alignment during single-language fine-tuning enhances cross-lingual generalization.

Authors:Chi Xu, Gefei Zhang, Yantong Zhu, Luca Benini, Guosheng Hu, Yawei Li, Zhihong Zhang
Title: Towards Extreme Pruning of LLMs with Plug-and-Play Mixed Sparsity
Abstract:
N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).
Chinese: 本研究提出了混合稀疏剪枝(MSP)方法,通过Fisher信息矩阵的迹量化各层敏感性,并采用进化算法优化不同层的稀疏度,在极端剪枝条件下显著超越了现有方法的性能表现。
English: The study introduces Mixed Sparsity Pruning (MSP), an efficient method that leverages the trace of the Fisher Information Matrix to measure layer sensitivities and employs an evolutionary algorithm to optimize sparsity levels across different layers, significantly outperforming existing approaches in extreme pruning scenarios.

Authors:Yangyang Xie, Cheng Hu, Nicolas Baumann, Edoardo Ghignone, Michele Magno, Lei Xie
Title: GP-enhanced Autonomous Drifting Framework using ADMM-based iLQR
Abstract:
Autonomous drifting is a complex challenge due to the highly nonlinear dynamics and the need for precise real-time control, especially in uncertain environments. To address these limitations, this paper presents a hierarchical control framework for autonomous vehicles drifting along general paths, primarily focusing on addressing model inaccuracies and mitigating computational challenges in real-time control. The framework integrates Gaussian Process (GP) regression with an Alternating Direction Method of Multipliers (ADMM)-based iterative Linear Quadratic Regulator (iLQR). GP regression effectively compensates for model residuals, improving accuracy in dynamic conditions. ADMM-based iLQR not only combines the rapid trajectory optimization of iLQR but also utilizes ADMM's strength in decomposing the problem into simpler sub-problems. Simulation results demonstrate the effectiveness of the proposed framework, with significant improvements in both drift trajectory tracking and computational efficiency. Our approach resulted in a 38$\%$ reduction in RMSE lateral error and achieved an average computation time that is 75$\%$ lower than that of the Interior Point OPTimizer (IPOPT).
中文摘要:本文提出一种结合高斯过程回归与ADMM-iLQR的分层控制框架,用于提升自动驾驶车辆的漂移性能,实现了38%的横向误差降低和75%的计算速度提升。
English Summary: This paper introduces a hierarchical control framework combining Gaussian Process regression and ADMM-based iLQR to enhance autonomous vehicle drifting, achieving 38% lower lateral error and 75% faster computation than traditional methods.

Authors:Yixuan Zhang, Qing Chang, Yuxi Wang, Guang Chen, Zhaoxiang Zhang, Junran Peng
Title: EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models
Abstract:
Speech-driven 3D facial animation seeks to produce lifelike facial expressions that are synchronized with the speech content and its emotional nuances, finding applications in various multimedia fields. However, previous methods often overlook emotional facial expressions or fail to disentangle them effectively from the speech content. To address these challenges, we present EmoDiffusion, a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. Specifically, our method employs two Variational Autoencoders (VAEs) to separately generate the upper face region and mouth region, thereby learning a more refined representation of the facial sequence. Unlike traditional methods that use diffusion models to connect facial expression sequences with audio inputs, we perform the diffusion process in the latent space. Furthermore, we introduce an Emotion Adapter to evaluate upper face movements accurately. Given the paucity of 3D emotional talking face data in the animation industry, we capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone. This effort results in the creation of an innovative 3D blendshape emotional talking face dataset (3D-BEF) used to train our network. Extensive experiments and perceptual evaluations validate the effectiveness of our approach, confirming its superiority in generating realistic and emotionally rich facial animations.
Chinese: EmoDiffusion是一种创新方法,通过使用两个变分自编码器分别处理上脸和嘴部区域,在潜在空间进行扩散,并引入情感适配器,有效分离语音中的情感以生成逼真的3D面部动画,其效果通过新型3D-BEF数据集和实验得到验证。
English: EmoDiffusion is a novel method that disentangles emotions from speech to generate realistic 3D facial animations by using two VAEs for upper face and mouth regions, performing diffusion in latent space, and introducing an Emotion Adapter, validated through a new 3D-BEF dataset and experiments.

Authors:Piyush Gupta, Sangjae Bae, David Isele
Title: Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations
Abstract:
The adoption of Large Language Models (LLMs) is rapidly expanding across various tasks that involve inherent graphical structures. Graphs are integral to a wide range of applications, including motion planning for autonomous vehicles, social networks, scene understanding, and knowledge graphs. Many problems, even those not initially perceived as graph-based, can be effectively addressed through graph theory. However, when applied to these tasks, LLMs often encounter challenges, such as hallucinations and mathematical inaccuracies. To overcome these limitations, we propose Graph-Grounded LLMs, a system that improves LLM performance on graph-related tasks by integrating a graph library through function calls. By grounding LLMs in this manner, we demonstrate significant reductions in hallucinations and improved mathematical accuracy in solving graph-based problems, as evidenced by the performance on the NLGraph benchmark. Finally, we showcase a disaster rescue application where the Graph-Grounded LLM acts as a decision-support system.
中文摘要:提出的图基大语言模型系统通过集成图库来增强大语言模型在图相关任务中的表现,有效减少幻觉并提高数学准确性,这在NLGraph基准测试中得到了验证。
English Summary: The proposed Graph-Grounded LLMs system enhances LLM performance on graph-related tasks by integrating a graph library, effectively reducing hallucinations and improving mathematical accuracy as demonstrated on the NLGraph benchmark.

Authors:Yang Zheng, Menglei Chai, Delio Vicini, Yuxiao Zhou, Yinghao Xu, Leonidas Guibas, Gordon Wetzstein, Thabo Beeler
Title: GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling
Abstract:
We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. Existing hair capture methods struggle to balance photorealistic rendering with relighting capabilities. Analytical material models, while physically grounded, often fail to fully capture appearance details. Conversely, neural rendering approaches excel at view synthesis but generalize poorly to novel lighting conditions. GroomLight addresses this challenge by combining the strengths of both paradigms. It employs an extended hair BSDF model to capture primary light transport and a light-aware residual model to reconstruct the remaining details. We further propose a hybrid inverse rendering pipeline to optimize both components, enabling high-fidelity relighting, view synthesis, and material editing. Extensive evaluations on real-world hair data demonstrate state-of-the-art performance of our method.
Chinese: GroomLight提出了一种混合方法,结合扩展头发BSDF模型处理主要光线传输和光感知残差模型重建细节,通过优化的逆向渲染流程实现了高保真的重光照与视角合成。
English: GroomLight introduces a hybrid approach combining an extended hair BSDF model for primary light transport and a light-aware residual model for detail reconstruction, enabling high-fidelity relighting and view synthesis through an optimized inverse rendering pipeline.

Authors:Jun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang, Tianyao Ma, Haozhi Yuan
Title: Efficient Precoding in XL-MIMO-AFDM System
Abstract:
This paper explores the potential of affine frequency division multiplexing (AFDM) to mitigate the multiuser interference (MUI) problem by employing time-domain precoding in extremely-large-scale multiple-input multiple-output (XL-MIMO) systems. In XL-MIMO systems, user mobility significantly improves network capacity and transmission quality. Meanwhile, the robustness of AFDM to Doppler shift is enhanced in user mobility scenarios, which further improves the system performance. However, the multicarrier nature of AFDM has attracted much attention, and it leads to a significant increase in precoding complexity. However, the serious problem is that the multicarrier use of AFDM leads to a sharp increase in precoding complexity. Therefore, we employ efficient precoding randomized Kaczmarz (rKA) to reduce the complexity overhead. Through simulation analysis, we compare the performance of XL-MIMO-AFDM and XL-MIMO orthogonal frequency division multiplexing (XL-MIMO-OFDM) in mobile scenarios, and the results show that our proposed AFDM-based XL-MIMO precoding design can be more efficient.
中文: 本文研究表明,在XL-MIMO系统中采用仿射频率分割复用(AFDM)结合时域预编码可有效减轻多用户干扰,通过随机Kaczmarz算法降低预编码复杂度,且在移动场景下性能优于正交频分复用(OFDM)方案。
English: This paper demonstrates that affine frequency division multiplexing (AFDM) with time-domain precoding effectively reduces multiuser interference in XL-MIMO systems, and employing the randomized Kaczmarz algorithm lowers precoding complexity while outperforming OFDM in mobile scenarios.

Authors:Xingxin Xu, Bing Cao, Yinan Xia, Pengfei Zhu, Qinghua Hu
Title: Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion
Abstract:
Image fusion aims to integrate comprehensive information from images acquired through multiple sources. However, images captured by diverse sensors often encounter various degradations that can negatively affect fusion quality. Traditional fusion methods generally treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them; notably, the dominant regions in one modality of a fused image often indicate areas where the other modality might benefit from enhancement. Inspired by this observation, we introduce the concept of dominant regions for image enhancement and present a Dynamic Relative EnhAnceMent framework for Image Fusion (Dream-IF). This framework quantifies the relative dominance of each modality across different layers and leverages this information to facilitate reciprocal cross-modal enhancement. By integrating the relative dominance derived from image fusion, our approach supports not only image restoration but also a broader range of image enhancement applications. Furthermore, we employ prompt-based encoding to capture degradation-specific details, which dynamically steer the restoration process and promote coordinated enhancement in both multi-modal image fusion and image enhancement scenarios. Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts.
Chinese: Dream-IF框架提出了一种动态跨模态增强方法,利用图像模态间的相对主导性将修复与融合过程相结合,并通过基于提示的退化感知机制实现卓越性能。
English: The Dream-IF framework introduces a dynamic cross-modal enhancement approach that leverages relative dominance between image modalities to integrate restoration and fusion processes, achieving superior performance through degradation-aware prompts.

Authors:Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu
Title: Probabilistic Reasoning with LLMs for k-anonymity Estimation
Abstract:
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.
中文摘要:本文提出BRANCH这一新型大语言模型方法,通过贝叶斯网络分解联合概率分布来评估文本的k-隐私值,相比现有方法将准确率提升了13%。
English Summary: This paper introduces BRANCH, a novel large language model methodology that estimates the k-privacy value of texts by factorizing joint probability distributions using Bayesian networks, achieving a 13% improvement in accuracy over existing methods.

Authors:Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu
Title: Probabilistic Reasoning with LLMs for k-anonymity Estimation
Abstract:
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.
中文摘要:本文提出BRANCH这一新型大语言模型方法,通过贝叶斯网络分解联合概率分布来评估文本的k-隐私值,相比现有方法将准确率提升了13%。
English Summary: This paper introduces BRANCH, a novel large language model methodology that estimates the k-privacy value of texts by factorizing joint probability distributions using Bayesian networks, achieving a 13% improvement in accuracy over existing methods.

Authors:Ruohao Guo, Wei Xu, Alan Ritter
Title: How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Abstract:
As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs' capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.
中文摘要:本研究推出首个评估大语言模型处理查询中隐含错误信息能力的基准EchoMist,发现现有模型表现欠佳,并提出应对这一关键安全漏洞的缓解方法。
English Summary: This study introduces EchoMist, the first benchmark for evaluating how large language models handle implicit misinformation embedded in queries, revealing that current models perform poorly and proposing mitigation methods to address this critical safety gap.

Authors:Ruohao Guo, Wei Xu, Alan Ritter
Title: How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Abstract:
As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs' capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.
中文摘要:本研究推出首个评估大语言模型处理查询中隐含错误信息能力的基准EchoMist,发现现有模型表现欠佳,并提出应对这一关键安全漏洞的缓解方法。
English Summary: This study introduces EchoMist, the first benchmark for evaluating how large language models handle implicit misinformation embedded in queries, revealing that current models perform poorly and proposing mitigation methods to address this critical safety gap.

Authors:Ruihai Wu, Ziyu Zhu, Yuran Wang, Yue Chen, Jiarui Wang, Hao Dong
Title: GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
Abstract:
Cluttered garments manipulation poses significant challenges due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, while being aware of garment geometry, structure, and inter-object relations. Additionally, as it is difficult to directly retrieve a garment in some extremely entangled clutters, we introduce an adaptation module, guided by learned affordance, to reorganize highly-entangled garments into states plausible for manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile configurations in both simulation and the real world. Project page: https://garmentpile.github.io/.
中文: 本研究提出一种学习点级可供性框架,通过感知衣物几何与相互关系,在杂乱环境中有效重组纠缠衣物,实现稳定操作,适用于多种衣物类型与堆叠配置。
English: This study introduces a framework that learns point-level affordance to manage complex garment interactions in cluttered environments, effectively reorganizing entangled garments for stable manipulation across diverse scenarios.

Authors:Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Title: MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input
Abstract:
Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.
中文: 提出的无掩码虚拟试穿(MF-VITON)框架通过两阶段流程,合成真实数据集并微调预训练模型,无需用户提供掩码,在服装转移精度和视觉真实感上达到领先水平。
English: The proposed Mask-Free VITON (MF-VITON) framework eliminates the need for user-provided masks by using a two-stage pipeline that synthesizes a realistic dataset and fine-tunes a pre-trained model, achieving state-of-the-art performance in garment transfer accuracy and visual realism.

Authors:Han Cao, Lingwei Wei, Wei Zhou, Songlin Hu
Title: Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models
Abstract:
The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world situations. Recent studies attempt to improve both understanding and reasoning abilities to enhance the performance, but they overlook the crucial relations between entities that benefit models to understand better and facilitate the prediction. To emphasize the significance of relations, we resort to Large Language Models (LLMs) considering their excellent understanding ability. Instead of other methods using LLMs as the predictor, we take them as relation extractors, for they do better in understanding rather than reasoning according to the experimental results. Thus, to solve the challenges above, we propose a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) for multi-hop fact verification. Specifically, we utilize an LLM-driven Knowledge Extractor to capture fine-grained information, including entities and their complicated relations. Besides, we leverage a Knowledge-Augmented Relation Graph Fusion module to interact with each node and learn better claim-evidence representations comprehensively. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.
Chinese: 本研究提出了一种新颖的结构化知识增强大语言模型网络(LLM-SKAN),利用大语言模型作为关系提取器来捕捉复杂的实体关系,通过整合结构化知识和改进声明-证据表征,从而提升了多跳事实核查的性能。
English: This study introduces a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) that leverages large language models as relation extractors to capture complex entity relationships, enhancing multi-hop fact verification by integrating structured knowledge and improving claim-evidence representation.

Authors:Maximilian Tölle, Theo Gruner, Daniel Palenicek, Jonas Günster, Puze Liu, Joe Watson, Davide Tateo, Jan Peters
Title: Towards Safe Robot Foundation Models
Abstract:
Robot foundation models hold the potential for deployment across diverse environments, from industrial applications to household tasks. While current research focuses primarily on the policies' generalization capabilities across a variety of tasks, it fails to address safety, a critical requirement for deployment on real-world systems. In this paper, we introduce a safety layer designed to constrain the action space of any generalist policy appropriately. Our approach uses ATACOM, a safe reinforcement learning algorithm that creates a safe action space and, therefore, ensures safe state transitions. By extending ATACOM to generalist policies, our method facilitates their deployment in safety-critical scenarios without requiring any specific safety fine-tuning. We demonstrate the effectiveness of this safety layer in an air hockey environment, where it prevents a puck-hitting agent from colliding with its surroundings, a failure observed in generalist policies.
中文: 本文基于ATACOM算法提出安全层机制,通过约束通用策略的动作空间确保状态转换安全,实现在安全关键场景中的直接部署,并在空气曲棍球环境中验证了其防碰撞有效性。
English: This paper introduces a safety layer based on the ATACOM algorithm to ensure safe state transitions for generalist robot policies, enabling their deployment in safety-critical scenarios without specific fine-tuning, as demonstrated in an air hockey environment.

Authors:Haoyue Dai, Ignavier Ng, Jianle Sun, Zeyu Tang, Gongxu Luo, Xinshuai Dong, Peter Spirtes, Kun Zhang
Title: When Selection Meets Intervention: Additional Complexities in Causal Discovery
Abstract:
We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences in when and where interventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.
中文摘要:本研究通过构建区分观测世界与反事实世界的图模型,有效解决了干预研究中的选择偏误问题,并提出经实验验证的可靠算法,实现了存在选择偏倚时的准确因果发现。
English Summary: This study tackles selection bias in interventional studies by introducing a graphical model that distinguishes between observed and counterfactual worlds, enabling accurate causal discovery through a proven algorithm validated in experiments.

Authors:Hyeonsoo Jo, Jongha Lee, Fanchen Bu, Kijung Shin
Title: TiGer: Self-Supervised Purification for Time-evolving Graphs
Abstract:
Time-evolving graphs, such as social and citation networks, often contain noise that distorts structural and temporal patterns, adversely affecting downstream tasks, such as node classification. Existing purification methods focus on static graphs, limiting their ability to account for critical temporal dependencies in dynamic graphs. In this work, we propose TiGer (Time-evolving Graph purifier), a self-supervised method explicitly designed for time-evolving graphs. TiGer assigns two different sub-scores to edges using (1) self-attention for capturing long-term contextual patterns shaped by both adjacent and distant past events of varying significance and (2) statistical distance measures for detecting inconsistency over a short-term period. These sub-scores are used to identify and filter out suspicious (i.e., noise-like) edges through an ensemble strategy, ensuring robustness without requiring noise labels. Our experiments on five real-world datasets show TiGer filters out noise with up to 10.2% higher accuracy and improves node classification performance by up to 5.3%, compared to state-of-the-art methods.
Chinese: TiGer是一种自监督方法,通过结合捕捉长期上下文的自注意力机制和检测短期不一致的统计度量,有效净化时序演化图,在无需噪声标签的情况下实现了更优的噪声过滤和节点分类性能提升。
English: TiGer is a self-supervised method that purifies time-evolving graphs by combining self-attention for long-term context and statistical measures for short-term inconsistency, achieving superior noise removal and node classification improvements without requiring noise labels.

Authors:Weihao Cui, Ziyi Xu, Han Zhao, Quan Chen, Zijun Li, Bingsheng He, Minyi Guo
Title: Efficient Function-as-a-Service for Large Language Models with TIDAL
Abstract:
Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by $1.79\times\text{\textasciitilde}2.11\times$ and improves the $95\%$-ile time-to-first-token by $76.0\%$, surpassing state-of-the-art methods.
Chinese: TIDAL作为一种优化的FaaS框架,通过追踪细粒度执行路径生成自适应模板,显著降低LLM函数的冷启动延迟达2.11倍,并将首令牌响应时间提升76.0%。
English: TIDAL is an optimized FaaS framework that reduces LLM function cold starts by tracing execution paths and generating adaptive templates, cutting latency by up to 2.11× and improving first-token response time by 76.0%.

Authors:Xuexin Chen, Ruichu Cai, Zhengting Huang, Zijian Li, Jie Zheng, Min Wu
Title: Interpretable High-order Knowledge Graph Neural Network for Predicting Synthetic Lethality in Human Cancers
Abstract:
Synthetic lethality (SL) is a promising gene interaction for cancer therapy. Recent SL prediction methods integrate knowledge graphs (KGs) into graph neural networks (GNNs) and employ attention mechanisms to extract local subgraphs as explanations for target gene pairs. However, attention mechanisms often lack fidelity, typically generate a single explanation per gene pair, and fail to ensure trustworthy high-order structures in their explanations. To overcome these limitations, we propose Diverse Graph Information Bottleneck for Synthetic Lethality (DGIB4SL), a KG-based GNN that generates multiple faithful explanations for the same gene pair and effectively encodes high-order structures. Specifically, we introduce a novel DGIB objective, integrating a Determinant Point Process (DPP) constraint into the standard IB objective, and employ 13 motif-based adjacency matrices to capture high-order structures in gene representations. Experimental results show that DGIB4SL outperforms state-of-the-art baselines and provides multiple explanations for SL prediction, revealing diverse biological mechanisms underlying SL inference.
中文: DGIB4SL提出了一种新颖的图神经网络方法,通过生成多个可靠解释并有效捕捉高阶结构来克服注意力机制在合成致死性预测中的局限性,展现出优越性能并揭示了多样化的生物学机制。
English: DGIB4SL introduces a novel graph neural network approach that overcomes limitations of attention mechanisms by generating multiple faithful explanations and effectively capturing high-order structures for synthetic lethality prediction, demonstrating superior performance and revealing diverse biological mechanisms.

Authors:Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami, Usman Gohar, Ben Huang, Supheakmungkol Sarin, Elie Alhajjar, Canyu Chen, Roman Eng, Kashyap Ramanandula Manjusha, Virendra Mehta, Eileen Long, Murali Emani, Natan Vidra, Benjamin Rukundo, Abolfazl Shahbazi, Kongtao Chen, Rajat Ghosh, Vithursan Thangarasa, Pierre Peigné, Abhinav Singh, Max Bartolo, Satyapriya Krishna, Mubashara Akhtar, Rafael Gold, Cody Coleman, Luis Oala, Vassil Tashev, Joseph Marvin Imperial, Amy Russ, Sasidhar Kunapuli, Nicolas Miailhe, Julien Delaunay, Bhaktipriya Radharapu, Rajat Shinde, Tuesday, Debojyoti Dutta, Declan Grabb, Ananya Gangavarapu, Saurav Sahay, Agasthya Gangavarapu, Patrick Schramowski, Stephen Singam, Tom David, Xudong Han, Priyanka Mary Mammen, Tarunima Prabhakar, Venelin Kovatchev, Rebecca Weiss, Ahmed Ahmed, Kelvin N. Manyeki, Sandeep Madireddy, Foutse Khomh, Fedor Zhdanov, Joachim Baumann, Nina Vasan, Xianjun Yang, Carlos Mougn, Jibin Rajan Varghese, Hussain Chinoy, Seshakrishna Jitendar, Manil Maskey, Claire V. Hardgrove, Tianhao Li, Aakash Gupta, Emil Joswin, Yifan Mai, Shachi H Kumar, Cigdem Patlak, Kevin Lu, Vincent Alessi, Sree Bhargavi Balija, Chenhe Gu, Robert Sullivan, James Gealy, Matt Lavrisa, James Goel, Peter Mattson, Percy Liang, Joaquin Vanschoren
Title: AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Abstract:
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
中文: 本文推出首个行业标准AI安全评估基准AILuminate v1.0,通过五级评分体系和基于熵的创新评估方法,对12类危险领域进行系统化测试,为建立全球AI风险评价标准迈出关键一步,同时指出当前单轮交互等局限性需持续改进。
English: This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for evaluating AI system safety across 12 hazard categories, featuring a five-tier grading scale and an innovative entropy-based evaluation method to address urgent needs for standardized risk assessment.

Authors:Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du
Title: A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Abstract:
Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.
Chinese: 本文全面综述了稀疏自编码器作为理解大语言模型内部机制的关键可解释性方法,涵盖其技术框架、特征解释途径、评估指标及实际应用。
English: This paper provides a comprehensive survey of Sparse Autoencoders (SAEs) as a key mechanistic interpretability method for understanding the internal mechanisms of Large Language Models, covering their technical framework, feature explanation approaches, evaluation metrics, and real-world applications.

Authors:Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Francesco Brucchi, Luca Emanuele Amodio, Didier Mutter, Nicolas Padoy
Title: S4M: Segment Anything with 4 Extreme Points
Abstract:
The Segment Anything Model (SAM) has revolutionized open-set interactive image segmentation, inspiring numerous adapters for the medical domain. However, SAM primarily relies on sparse prompts such as point or bounding box, which may be suboptimal for fine-grained instance segmentation, particularly in endoscopic imagery, where precise localization is critical and existing prompts struggle to capture object boundaries effectively. To address this, we introduce S4M (Segment Anything with 4 Extreme Points), which augments SAM by leveraging extreme points -- the top-, bottom-, left-, and right-most points of an instance -- prompts. These points are intuitive to identify and provide a faster, structured alternative to box prompts. However, a naïve use of extreme points degrades performance, due to SAM's inability to interpret their semantic roles. To resolve this, we introduce dedicated learnable embeddings, enabling the model to distinguish extreme points from generic free-form points and better reason about their spatial relationships. We further propose an auxiliary training task through the Canvas module, which operates solely on prompts -- without vision input -- to predict a coarse instance mask. This encourages the model to internalize the relationship between extreme points and mask distributions, leading to more robust segmentation. S4M outperforms other SAM-based approaches on three endoscopic surgical datasets, demonstrating its effectiveness in complex scenarios. Finally, we validate our approach through a human annotation study on surgical endoscopic videos, confirming that extreme points are faster to acquire than bounding boxes.
中文: S4M模型通过引入极值点提示、可学习嵌入和辅助训练任务,改进了Segment Anything模型在复杂内窥镜图像中的细粒度实例分割能力,在多个数据集上表现优异,且人工标注速度优于边界框。
English: The S4M model enhances the Segment Anything Model by using extreme points as prompts, incorporating learnable embeddings and an auxiliary training task to improve fine-grained instance segmentation in endoscopic imagery, outperforming other methods across multiple datasets and proving faster for human annotation than bounding boxes.

Authors:Cixiao Zhang, Yin Xu, Size Peng, Xinghao Guo, Xiaowu Ou, Hanjiang Hong, Dazhi He, Wenjun Zhang
Title: Fluid Antenna-Aided Robust Secure Transmission for RSMA-ISAC Systems
Abstract:
This paper leverages fluid antenna (FA) and rate-splitting multiple access (RSMA) to enhance the physical layer security (PLS) of an integrated sensing and communication (ISAC) system. We consider a practical multi-user multi-input single-output (MU-MISO) system, where a base station (BS) equipped with fixed position antennas (FPAs) employs RSMA to communicate with multiple single-FA users, while an eavesdropping target may potentially wiretap the signals. The system adopts a novel rate splitting (RS) scheme, where the common layer stream serves a dual purpose: it conveys valid data to legitimate users (LUs) while simultaneously generating jamming signals to confuse potential eavesdroppers. We establish the problem and propose the optimization algorithm under two conditions: perfect and imperfect channel state information (CSI) conditions. Specifically, under perfect the CSI condition, we address the non-convex optimization problem by proposing an alternating optimization (AO) algorithm, which decomposes the problem into two subproblems: beamforming matrix optimization and the adjustment of FA positions. For beamforming optimization, we utilize semidefinite programming (SDP) and successive convex approximation (SCA) to convert the problem into a more tractable convex form. Given a fixed beamforming matrix, SCA is applied to handle the surrogate upper bound of the constraints. In the case of imperfect CSI, the continuous nature of CSI errors leads to an infinite number of constraints. To overcome this challenge, we propose an AO-based algorithm that incorporates the S-Procedure and SCA to obtain a high-quality beamforming matrix and effective FA positions. Extensive simulation results demonstrate that the proposed FA-aided RSMA-ISAC system significantly enhances security compared to traditional FPA-based and SDMA-based systems.
中文摘要:本文提出了一种基于流体天线和速率分割多址接入的系统,通过联合优化波束成形和天线位置,在完美与非完美信道状态下显著提升了集成传感与通信的物理层安全性。
English Summary: This paper proposes a fluid antenna-assisted rate-splitting multiple access system that enhances physical layer security in integrated sensing and communication by jointly optimizing beamforming and antenna positioning under both perfect and imperfect channel conditions.

Authors:Qijiong Liu, Jieming Zhu, Lu Fan, Kun Wang, Hengchang Hu, Wei Guo, Yong Liu, Xiao-Ming Wu
Title: Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders
Abstract:
In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.
中文摘要:本文提出RecBench基准测试,发现基于大语言模型的推荐系统在准确性上显著优于传统推荐系统,但其推理效率较低,难以适用于实时推荐场景。
English Summary: This paper introduces RecBench, a comprehensive benchmark demonstrating that LLM-based recommenders significantly outperform traditional systems in accuracy but suffer from reduced inference efficiency, making them impractical for real-time applications.

Authors:Tao Yang, Yang Hu, Feihong Lu, Ziwei Zhang, Qingyun Sun, Jianxin Li
Title: BotUmc: An Uncertainty-Aware Twitter Bot Detection with Multi-view Causal Inference
Abstract:
Social bots have become widely known by users of social platforms. To prevent social bots from spreading harmful speech, many novel bot detections are proposed. However, with the evolution of social bots, detection methods struggle to give high-confidence answers for samples. This motivates us to quantify the uncertainty of the outputs, informing the confidence of the results. Therefore, we propose an uncertainty-aware bot detection method to inform the confidence and use the uncertainty score to pick a high-confidence decision from multiple views of a social network under different environments. Specifically, our proposed BotUmc uses LLM to extract information from tweets. Then, we construct a graph based on the extracted information, the original user information, and the user relationship and generate multiple views of the graph by causal interference. Lastly, an uncertainty loss is used to force the model to quantify the uncertainty of results and select the result with low uncertainty in one view as the final decision. Extensive experiments show the superiority of our method.
中文: 针对社交机器人进化导致检测置信度下降的问题,本研究提出BotUmc不确定性感知检测方法,利用大语言模型分析推文,通过因果干预构建多视角图,并采用不确定性损失选择高置信度决策,实验证明该方法具有优越性能。
English: To address the challenge of evolving social bots that reduce detection confidence, this study introduces BotUmc, an uncertainty-aware detection method that leverages LLMs to analyze tweets, constructs multi-view graphs with causal interference, and uses uncertainty loss to select high-confidence decisions, demonstrating superior performance in experiments.

Authors:Leixian Shen, Haotian Li, Yifang Wang, Xing Xie, Huamin Qu
Title: Prompting Generative AI with Interaction-Augmented Instructions
Abstract:
The emergence of generative AI (GenAI) models, including large language models and text-to-image models, has significantly advanced the synergy between humans and AI with not only their outstanding capability but more importantly, the intuitive communication method with text prompts. Though intuitive, text-based instructions suffer from natural languages' ambiguous and redundant nature. To address the issue, researchers have explored augmenting text-based instructions with interactions that facilitate precise and effective human intent expression, such as direct manipulation. However, the design strategy of interaction-augmented instructions lacks systematic investigation, hindering our understanding and application. To provide a panorama of interaction-augmented instructions, we propose a framework to analyze related tools from why, when, who, what, and how interactions are applied to augment text-based instructions. Notably, we identify four purposes for applying interactions, including restricting, expanding, organizing, and refining text instructions. The design paradigms for each purpose are also summarized to benefit future researchers and practitioners.
中文摘要:生成式AI模型通过文本提示实现直观的人机交互,但存在语言模糊性问题,研究者通过结合直接操作等交互方式增强指令表达,提出了分析交互目的与设计范式的框架以指导未来研究。
English Summary: Generative AI models enable intuitive human-AI communication through text prompts, but face ambiguity issues that researchers address by augmenting instructions with interactive methods like direct manipulation, leading to a proposed framework analyzing interaction purposes and design paradigms.

Authors:Yunfan Zhou, Xiwen Cai, Qiming Shi, Yanwei Huang, Haotian Li, Huamin Qu, Di Weng, Yingcai Wu
Title: Xavier: Toward Better Coding Assistance in Authoring Tabular Data Wrangling Scripts
Abstract:
Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.
Chinese: Xavier是一种数据感知的代码补全工具,通过整合数据上下文提供精准建议、自动高亮相关数据并即时预览转换结果,经16位分析师用户研究验证,可有效提升计算笔记本中数据整理脚本的编写效率。
English: Xavier is a data-aware code completion tool that enhances data wrangling in computational notebooks by integrating data contexts for accurate suggestions, automatic highlighting of relevant data, and instant previews of transformations, as validated by a user study with 16 analysts.

Authors:Haotian Li, Yun Wang, Huamin Qu
Title: Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective
Abstract:
Human-AI collaborative tools attract attentions from the data storytelling community to lower the barrier of expertise and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify persistent collaboration patterns, e.g., human-creator + AI-assistant, and emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and other implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.
中文: 人机协作工具利用生成式AI提升数据叙事能力,通过比较新旧协作模式揭示了如AI主导创作与人类审核等新兴趋势,并提出了推动创新的未来研究方向。
English: Human-AI collaborative tools are advancing data storytelling by leveraging generative AI to streamline workflows and introduce new partnership models, such as AI as creator with human oversight, while outlining future research directions.

Authors:Zahra Mirzaiyan, Michele Girfoglio, Gianluigi Rozza
Title: On the choice of proper outlet boundary conditions for numerical simulation of cardiovascular flows
Abstract:
It is well known that in the computational fluid dynamics simulations related to the cardiovascular system the enforcement of outflow boundary conditions is a crucial point. In fact, they highly affect the computed flow and a wrong setup could lead to unphysical results. In this chapter we discuss the main features of two different ways for the estimation of proper outlet boundary conditions in the context of hemodynamics simulations: on one side, a lumped parameter model of the downstream circulation and, on the other one, a technique based on optimal control.
中文: 在心血管流体动力学模拟中,正确设置流出边界条件对避免非物理结果至关重要,本章比较了集总参数模型和最优控制技术两种估算方法。
English: In cardiovascular fluid dynamics simulations, accurately setting outflow boundary conditions is critical to avoid unphysical results, with this chapter comparing lumped parameter models and optimal control techniques for proper estimation.

Authors:Tianyi Ma, Yiyue Qian, Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye
Title: LLM-Empowered Class Imbalanced Graph Prompt Learning for Online Drug Trafficking Detection
Abstract:
As the market for illicit drugs remains extremely profitable, major online platforms have become direct-to-consumer intermediaries for illicit drug trafficking participants. These online activities raise significant social concerns that require immediate actions. Existing approaches to combating this challenge are generally impractical, due to the imbalance of classes and scarcity of labeled samples in real-world applications. To this end, we propose a novel Large Language Model-empowered Heterogeneous Graph Prompt Learning framework for illicit Drug Trafficking detection, called LLM-HetGDT, that leverages LLM to facilitate heterogeneous graph neural networks (HGNNs) to effectively identify drug trafficking activities in the class-imbalanced scenarios. Specifically, we first pre-train HGNN over a contrastive pretext task to capture the inherent node and structure information over the unlabeled drug trafficking heterogeneous graph (HG). Afterward, we employ LLM to augment the HG by generating high-quality synthetic user nodes in minority classes. Then, we fine-tune the soft prompts on the augmented HG to capture the important information in the minority classes for the downstream drug trafficking detection task. To comprehensively study online illicit drug trafficking activities, we collect a new HG dataset over Twitter, called Twitter-HetDrug. Extensive experiments on this dataset demonstrate the effectiveness, efficiency, and applicability of LLM-HetGDT.
中文摘要:提出的LLM-HetGDT框架利用大语言模型增强异质图神经网络,在类别不平衡场景下有效检测非法毒品交易,并在新建的Twitter数据集上展现出优越性能。
English Summary: The proposed LLM-HetGDT framework leverages large language models to enhance heterogeneous graph neural networks for detecting illicit drug trafficking in class-imbalanced scenarios, demonstrating superior performance on a newly collected Twitter dataset.

Authors:Xueyang Feng, Bo Lan, Quanyu Dai, Lei Wang, Jiakai Tang, Xu Chen, Zhenhua Dong, Ji-Rong Wen
Title: Improving Retrospective Language Agents via Joint Policy Gradient Optimization
Abstract:
In recent research advancements within the community, large language models (LLMs) have sparked great interest in creating autonomous agents. However, current prompt-based agents often heavily rely on large-scale LLMs. Meanwhile, although fine-tuning methods significantly enhance the capabilities of smaller LLMs, the fine-tuned agents often lack the potential for self-reflection and self-improvement. To address these challenges, we introduce a novel agent framework named RetroAct, which is a framework that jointly optimizes both task-planning and self-reflective evolution capabilities in language agents. Specifically, we develop a two-stage joint optimization process that integrates imitation learning and reinforcement learning, and design an off-policy joint policy gradient optimization algorithm with imitation learning regularization to enhance the data efficiency and training stability in agent tasks. RetroAct significantly improves the performance of open-source models, reduces dependency on closed-source LLMs, and enables fine-tuned agents to learn and evolve continuously. We conduct extensive experiments across various testing environments, demonstrating RetroAct has substantial improvements in task performance and decision-making processes.
Chinese: RetroAct框架通过结合模仿学习与强化学习的双阶段联合优化,增强了语言代理的任务规划与自我反思进化能力,显著提升了开源模型性能并降低了对闭源大语言模型的依赖。
English: The RetroAct framework introduces a two-stage joint optimization process combining imitation and reinforcement learning to enhance task-planning and self-reflective evolution in language agents, significantly improving open-source model performance while reducing reliance on closed-source LLMs.

Authors:Chong Bao, Xiyu Zhang, Zehao Yu, Jiale Shi, Guofeng Zhang, Songyou Peng, Zhaopeng Cui
Title: Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
Abstract:
Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360° scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360° scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: https://zju3dv.github.io/free360/
中文: 本文提出了一种新颖的神经渲染框架,通过分层高斯表示和引导优化,在无边界360°场景中实现了从极稀疏、无位姿输入视图的高质量三维重建和新视角合成。
English: This paper introduces a novel neural rendering framework that achieves high-quality 3D reconstruction and novel view synthesis from extremely sparse, unposed views in unbounded 360° scenes through layered Gaussian representation and bootstrap optimization.

Authors:Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, Yuke Zhu
Title: Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
Abstract:
Large real-world robot datasets hold great potential to train generalist robot models, but scaling real-world human data collection is time-consuming and resource-intensive. Simulation has great potential in supplementing large-scale data, especially with recent advances in generative AI and automated data generation tools that enable scalable creation of robot behavior datasets. However, training a policy solely in simulation and transferring it to the real world often demands substantial human effort to bridge the reality gap. A compelling alternative is to co-train the policy on a mixture of simulation and real-world datasets. Preliminary studies have recently shown this strategy to substantially improve the performance of a policy over one trained on a limited amount of real-world data. Nonetheless, the community lacks a systematic understanding of sim-and-real co-training and what it takes to reap the benefits of simulation data for real-robot learning. This work presents a simple yet effective recipe for utilizing simulation data to solve vision-based robotic manipulation tasks. We derive this recipe from comprehensive experiments that validate the co-training strategy on various simulation and real-world datasets. Using two domains--a robot arm and a humanoid--across diverse tasks, we demonstrate that simulation data can enhance real-world task performance by an average of 38%, even with notable differences between the simulation and real-world data. Videos and additional results can be found at https://co-training.github.io/
中文: 本研究提出了一种结合仿真与现实数据的有效方法,通过大量实验证明这种协同训练策略能将真实世界任务表现平均提升38%,即使仿真环境与现实数据存在显著差异。
English: This research introduces an effective method for combining simulation and real-world data to enhance robot policy learning, demonstrating through extensive experiments that such co-training can boost real-world task performance by an average of 38% despite differences between simulated and actual environments.

Authors:Rong Kang, Shuai Wang, Tieying Zhang, Xianghong Xu, Linhui Xu, Zhimin Liang, Lei Zhang, Rui Shi, Jianjun Chen
Title: VIDEX: A Disaggregated and Extensible Virtual Index for the Cloud and AI Era
Abstract:
Virtual index, also known as hypothetical indexes, play a crucial role in database query optimization. However, with the rapid advancement of cloud computing and AI-driven models for database optimization, traditional virtual index approaches face significant challenges. Cloud-native environments often prohibit direct conducting query optimization process on production databases due to stability requirements and data privacy concerns. Moreover, while AI models show promising progress, their integration with database systems poses challenges in system complexity, inference acceleration, and model hot updates. In this paper, we present VIDEX, a three-layer disaggregated architecture that decouples database instances, the virtual index optimizer, and algorithm services, providing standardized interfaces for AI model integration. Users can configure VIDEX by either collecting production statistics or by loading from a prepared file; this setup allows for high-accurate what-if analyses based on virtual indexes, achieving query plans that are identical to those of the production instance. Additionally, users can freely integrate new AI-driven algorithms into VIDEX. VIDEX has been successfully deployed at ByteDance, serving thousands of MySQL instances daily and over millions of SQL queries for index optimization tasks.
Chinese: VIDEX提出三层解耦架构,在实现高精度虚拟索引优化的同时支持灵活集成AI算法,已成功部署于字节跳动,每日处理数百万SQL查询的索引优化任务。
English: VIDEX introduces a three-layer disaggregated architecture that enables high-accuracy virtual index optimization while allowing flexible AI algorithm integration, successfully deployed at ByteDance to handle millions of SQL queries daily.

Authors:Xiang Hu, Yuhao Wang, Pingping Zhang, Huchuan Lu
Title: LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
Abstract:
As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code will be available.
Chinese: LATex框架通过提示调优策略整合基于属性的文本知识,利用CLIP进行特征提取和结构化句子生成,在降低训练成本的同时有效提升了空地行人重识别的性能。
English: The LATex framework enhances Aerial-Ground person Re-Identification by integrating attribute-based text knowledge through prompt-tuning strategies, utilizing CLIP for feature extraction and structured sentence generation to improve performance while reducing training costs.

Authors:Zongwei Wang, Min Gao, Junliang Yu, Yupeng Hou, Shazia Sadiq, Hongzhi Yin
Title: RuleAgent: Discovering Rules for Recommendation Denoising with Autonomous Language Agents
Abstract:
The implicit feedback (e.g., clicks) in real-world recommender systems is often prone to severe noise caused by unintentional interactions, such as misclicks or curiosity-driven behavior. A common approach to denoising this feedback is manually crafting rules based on observations of training loss patterns. However, this approach is labor-intensive and the resulting rules often lack generalization across diverse scenarios. To overcome these limitations, we introduce RuleAgent, a language agent based framework which mimics real-world data experts to autonomously discover rules for recommendation denoising. Unlike the high-cost process of manual rule mining, RuleAgent offers rapid and dynamic rule discovery, ensuring adaptability to evolving data and varying scenarios. To achieve this, RuleAgent is equipped with tailored profile, memory, planning, and action modules and leverages reflection mechanisms to enhance its reasoning capabilities for rule discovery. Furthermore, to avoid the frequent retraining in rule discovery, we propose LossEraser-an unlearning strategy that streamlines training without compromising denoising performance. Experiments on benchmark datasets demonstrate that, compared with existing denoising methods, RuleAgent not only derives the optimal recommendation performance but also produces generalizable denoising rules, assisting researchers in efficient data cleaning.
Chinese Summary: RuleAgent是一种基于语言智能体的框架,通过模拟真实数据专家自主发现推荐系统去噪规则,利用定制化模块和反学习策略提升适应性与效率,克服了人工规则制定的局限性。
English Summary: RuleAgent is a language agent framework that autonomously discovers denoising rules for recommender systems, overcoming the limitations of manual rule crafting by using tailored modules and an unlearning strategy to enhance adaptability and efficiency.

Authors:Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, Filippos Kokkinos
Title: VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving. Project page: https://yufan-ren.com/subpage/VGRP-Bench/.
中文: 大型视觉语言模型在解决需要精确感知、规则理解和逻辑推理的谜题时存在困难,为此我们推出了VGRP-Bench基准测试,通过系统实验和微调策略来评估并提升其解谜能力。
English: Large Vision-Language Models face challenges in solving puzzles due to deficiencies in perception, rule comprehension, and logical reasoning, prompting the introduction of VGRP-Bench to systematically evaluate and enhance their capabilities through targeted fine-tuning strategies.

Authors:Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, Junwei Han
Title: CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Abstract:
Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.
中文: CityGS-X通过创新的并行混合分层架构解决了3D高斯泼溅技术处理速度慢、计算成本高的核心问题,在大规模场景中实现了更快的训练速度、更高的几何精度和高效的多GPU渲染能力。
English: CityGS-X overcomes 3D Gaussian Splatting's limitations of slow processing and high computational costs through a novel parallelized hybrid hierarchical architecture, achieving faster training and superior geometric accuracy in large-scale scenes with efficient multi-GPU rendering.

Authors:Xianglong He, Junyi Chen, Di Huang, Zexiang Liu, Xiaoshui Huang, Wanli Ouyang, Chun Yuan, Yangguang Li
Title: MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs
Abstract:
In the domain of 3D content creation, achieving optimal mesh topology through AI models has long been a pursuit for 3D artists. Previous methods, such as MeshGPT, have explored the generation of ready-to-use 3D objects via mesh auto-regressive techniques. While these methods produce visually impressive results, their reliance on token-by-token predictions in the auto-regressive process leads to several significant limitations. These include extremely slow generation speeds and an uncontrollable number of mesh faces. In this paper, we introduce MeshCraft, a novel framework for efficient and controllable mesh generation, which leverages continuous spatial diffusion to generate discrete triangle faces. Specifically, MeshCraft consists of two core components: 1) a transformer-based VAE that encodes raw meshes into continuous face-level tokens and decodes them back to the original meshes, and 2) a flow-based diffusion transformer conditioned on the number of faces, enabling the generation of high-quality 3D meshes with a predefined number of faces. By utilizing the diffusion model for the simultaneous generation of the entire mesh topology, MeshCraft achieves high-fidelity mesh generation at significantly faster speeds compared to auto-regressive methods. Specifically, MeshCraft can generate an 800-face mesh in just 3.2 seconds (35$\times$ faster than existing baselines). Extensive experiments demonstrate that MeshCraft outperforms state-of-the-art techniques in both qualitative and quantitative evaluations on ShapeNet dataset and demonstrates superior performance on Objaverse dataset. Moreover, it integrates seamlessly with existing conditional guidance strategies, showcasing its potential to relieve artists from the time-consuming manual work involved in mesh creation.
Chinese: MeshCraft提出了一种基于扩散的框架,能够以预定义面数快速生成高质量3D网格,相比传统自回归方法速度提升显著,在多个基准数据集上展现出优越性能。
English: MeshCraft introduces a diffusion-based framework that generates high-quality 3D meshes with predefined face counts at significantly faster speeds than previous auto-regressive methods, demonstrating superior performance on benchmark datasets.

Authors:Bin Han, Di Feng, Jie Wang, Hans D. Schotten
Title: Buyer-Initiated Auction Mechanism for Data Redemption in Machine Unlearning
Abstract:
The rapid growth of artificial intelligence (AI) has raised privacy concerns over user data, leading to regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). With the essential toolbox provided by machine unlearning, AI service providers are now able to remove user data from their trained models as well as the training datasets, so as to comply with such regulations. However, extensive data redemption can be costly and degrade model accuracy. To balance the cost of unlearning and the privacy protection, we propose a buyer-initiated auction mechanism for data redemption, enabling the service provider to purchase data from willing users with appropriate compensation. This approach does not require the server to have any a priori knowledge about the users' privacy preference, and provides an efficient solution for maximizing the social welfare in the investigated problem.
中文: 提出的买方发起拍卖机制使AI服务提供商能够高效地从用户处购买数据赎回,在无需了解用户偏好先验知识的情况下平衡遗忘成本与隐私保护。
English: The proposed buyer-initiated auction mechanism enables AI service providers to efficiently purchase data redemption from users, balancing unlearning costs with privacy protection without requiring prior knowledge of user preferences.

Authors:Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang
Title: Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation
Abstract:
Sequential Recommendation (SeqRec) aims to predict the next item by capturing sequential patterns from users' historical interactions, playing a crucial role in many real-world recommender systems. However, existing approaches predominantly adopt a direct forward computation paradigm, where the final hidden state of the sequence encoder serves as the user representation. We argue that this inference paradigm, due to its limited computational depth, struggles to model the complex evolving nature of user preferences and lacks a nuanced understanding of long-tail items, leading to suboptimal performance. To address this issue, we propose \textbf{ReaRec}, the first inference-time computing framework for recommender systems, which enhances user representations through implicit multi-step reasoning. Specifically, ReaRec autoregressively feeds the sequence's last hidden state into the sequential recommender while incorporating special reasoning position embeddings to decouple the original item encoding space from the multi-step reasoning space. Moreover, we introduce two lightweight reasoning-based learning methods, Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to further effectively exploit ReaRec's reasoning potential. Extensive experiments on five public real-world datasets and different SeqRec architectures demonstrate the generality and effectiveness of our proposed ReaRec. Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the performance ceiling of multiple sequential recommendation backbones by approximately 30\%-50\%. Thus, we believe this work can open a new and promising avenue for future research in inference-time computing for sequential recommendation.
中文: 本文提出ReaRec推理时计算框架,通过多步隐式推理增强序列推荐效果,在多种架构上实现30%-50%的性能提升。
English: The authors propose ReaRec, a novel inference-time computing framework that enhances sequential recommendation by enabling implicit multi-step reasoning, significantly boosting performance by 30%-50% across various architectures.

Authors:Kibon Ku, Talukder Z Jubery, Elijah Rodriguez, Aditya Balu, Soumik Sarkar, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: SC-NeRF: NeRF-based Point Cloud Reconstruction using a Stationary Camera for Agricultural Applications
Abstract:
This paper presents a NeRF-based framework for point cloud (PCD) reconstruction, specifically designed for indoor high-throughput plant phenotyping facilities. Traditional NeRF-based reconstruction methods require cameras to move around stationary objects, but this approach is impractical for high-throughput environments where objects are rapidly imaged while moving on conveyors or rotating pedestals. To address this limitation, we develop a variant of NeRF-based PCD reconstruction that uses a single stationary camera to capture images as the object rotates on a pedestal. Our workflow comprises COLMAP-based pose estimation, a straightforward pose transformation to simulate camera movement, and subsequent standard NeRF training. A defined Region of Interest (ROI) excludes irrelevant scene data, enabling the generation of high-resolution point clouds (10M points). Experimental results demonstrate excellent reconstruction fidelity, with precision-recall analyses yielding an F-score close to 100.00 across all evaluated plant objects. Although pose estimation remains computationally intensive with a stationary camera setup, overall training and reconstruction times are competitive, validating the method's feasibility for practical high-throughput indoor phenotyping applications. Our findings indicate that high-quality NeRF-based 3D reconstructions are achievable using a stationary camera, eliminating the need for complex camera motion or costly imaging equipment. This approach is especially beneficial when employing expensive and delicate instruments, such as hyperspectral cameras, for 3D plant phenotyping. Future work will focus on optimizing pose estimation techniques and further streamlining the methodology to facilitate seamless integration into automated, high-throughput 3D phenotyping pipelines.
Chinese: 本文提出了一种基于NeRF的静态相机重建方法,通过旋转植株对象生成高精度点云,在室内高通量表型分析中实现了近乎完美的重建效果,无需复杂相机运动系统。
English: This paper introduces a NeRF-based method for high-resolution point cloud reconstruction using a stationary camera on rotating plant objects, achieving near-perfect accuracy and practicality for indoor high-throughput phenotyping without complex camera setups.

Authors:Ming Yan, Xincheng Lin, Yuhua Luo, Shuqi Fan, Yudi Dai, Qixin Zhong, Lincai Zhong, Yuexin Ma, Lan Xu, Chenglu Wen, Siqi Shen, Cheng Wang
Title: ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
Abstract:
Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, including the challenging climbing motions of 22 skilled climbing coaches across 12 different rock walls. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and to optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and source code release publicly at \href{this link}{http://www.lidarhumanmotion.net/climbingcap/}
中文摘要:本研究提出了用于攀爬运动的全面数据集AscendMotion,并开发了ClimbingCap方法,通过融合RGB和激光雷达数据有效重建三维人体攀爬动作。
English Summary: This study introduces AscendMotion, a comprehensive dataset for climbing motion, and proposes ClimbingCap, a novel method that effectively reconstructs 3D human climbing motions by integrating RGB and LiDAR data.

Authors:Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu
Title: TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Abstract:
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone. In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17\times$ to $20.76\times$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs.
Chinese Summary: TileLink是一种通过前端解耦与后端集成实现高效编译和生成重叠计算-通信内核的新系统,相比非重叠基准实现了显著加速,并在GPU上达到了与先进重叠库相当的性能。
English Summary: TileLink is a novel system that efficiently compiles and generates overlapped compute-communication kernels through frontend decoupling and backend integration, achieving significant speedup over non-overlapping baselines and comparable performance to state-of-the-art libraries on GPUs.

Authors:Zitian Wang, Yue Liao, Kang Rong, Fengyun Rao, Yibo Yang, Si Liu
Title: Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Abstract:
Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.
中文: 偏好对齐被提出作为一种可扩展的框架,通过聚焦于指令执行效能来增强多模态大语言模型,在幻觉缓解、视觉问答和文本理解任务中均提升了性能。
English: Preference alignment is proposed as a scalable framework to enhance Multimodal Large Language Models by focusing on instruction fulfillment efficacy, improving performance across hallucination mitigation, visual question answering, and text understanding tasks.

Authors:Xinghao Wang, Tao Gong, Qi Chu, Bin Liu, Nenghai Yu
Title: Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement
Abstract:
Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.
中文摘要:本研究提出了一种新颖的弱监督图像篡改定位框架,通过结合上下文感知边界定位模块和CAM引导的SAM优化模块,仅使用图像级标签即可在多个数据集上实现卓越的定位性能。
English Summary: This study introduces a novel weakly supervised framework for image manipulation localization, combining a Context-Aware Boundary Localization module with a CAM-Guided SAM Refinement module to achieve superior performance across multiple datasets using only image-level labels.

Authors:Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Title: Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
Abstract:
Existing unlearning algorithms in text-to-image generative models often fail to preserve the knowledge of semantically related concepts when removing specific target concepts: a challenge known as adjacency. To address this, we propose FADE (Fine grained Attenuation for Diffusion Erasure), introducing adjacency aware unlearning in diffusion models. FADE comprises two components: (1) the Concept Neighborhood, which identifies an adjacency set of related concepts, and (2) Mesh Modules, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving atleast a 12% improvement in retention performance over state-of-the-art methods.
中文: FADE提出了一种邻接感知的遗忘方法,能在扩散模型中精确擦除目标概念,同时保留相关概念,其保留性能比现有最优方法至少提升12%。
English: FADE introduces an adjacency-aware unlearning approach for diffusion models, effectively removing target concepts while preserving related ones with at least a 12% improvement in retention over existing methods.

Authors:Haoran Yin, Anna V. Kononova, Thomas Bäck, Niki van Stein
Title: Optimizing Photonic Structures with Large Language Model Driven Algorithm Discovery
Abstract:
We study how large language models can be used in combination with evolutionary computation techniques to automatically discover optimization algorithms for the design of photonic structures. Building on the Large Language Model Evolutionary Algorithm (LLaMEA) framework, we introduce structured prompt engineering tailored to multilayer photonic problems such as Bragg mirror, ellipsometry inverse analysis, and solar cell antireflection coatings. We systematically explore multiple evolutionary strategies, including (1+1), (1+5), (2+10), and others, to balance exploration and exploitation. Our experiments show that LLM-generated algorithms, generated using small-scale problem instances, can match or surpass established methods like quasi-oppositional differential evolution on large-scale realistic real-world problem instances. Notably, LLaMEA's self-debugging mutation loop, augmented by automatically extracted problem-specific insights, achieves strong anytime performance and reliable convergence across diverse problem scales. This work demonstrates the feasibility of domain-focused LLM prompts and evolutionary approaches in solving optical design tasks, paving the way for rapid, automated photonic inverse design.
中文: 本研究证明,将大型语言模型与进化计算相结合,可为光子结构设计生成媲美甚至超越传统方法的优化算法,实现高效的自动化逆向设计。
English: This research demonstrates that combining large language models with evolutionary computation can generate optimization algorithms that match or exceed traditional methods in designing photonic structures, enabling efficient automated inverse design.

Authors:Maryam Bala, Amina Imam Abubakar, Abdulhamid Abubakar, Abdulkadir Shehu Bichi, Hafsa Kabir Ahmad, Sani Abdullahi Sani, Idris Abdulmumin, Shamsuddeen Hassan Muhamad, Ibrahim Said Ahmad
Title: HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection
Abstract:
This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.
中文摘要:本研究介绍了MU-SHROOM多语言大模型幻觉检测任务,通过微调ModernBERT模型发现其置信度与幻觉存在存在中等相关性,但在精确定位幻觉范围方面仍面临挑战。
English Summary: This study introduces the MU-SHROOM task for detecting hallucinations in multilingual LLM outputs, employing a fine-tuned ModernBERT model that shows moderate correlation with actual hallucinations but limited precision in identifying exact error spans.

Authors:Yufan Ren, Zicong Jiang, Tong Zhang, Søren Forchhammer, Sabine Süsstrunk
Title: FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
Abstract:
Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.
中文: 本文提出一种基于频率感知的图像编辑方法,通过小波分解对局部区域特定频段进行选择性优化,有效解决了文本引导编辑中的失真问题,并可扩展至三维纹理编辑,实验证明其具有高精度优势。
English: This paper introduces a frequency-aware image editing method that uses wavelet decomposition to selectively optimize specific frequency bands in localized regions, effectively addressing unintended modifications in text-guided editing and extending to 3D texture adjustments with demonstrated precision.

Authors:Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong
Title: Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling
Abstract:
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
中文摘要:VocAgnoLM通过词汇对齐和教师引导损失解决师生模型间的词汇不匹配问题,在词汇重叠率极低的情况下仍能实现显著性能提升。
English Summary: VocAgnoLM addresses vocabulary mismatches between teacher and student language models through token-level lexical alignment and teacher-guided loss, achieving significant performance improvements even with minimal vocabulary overlap.

Authors:Dhruv Sahnan, David Corney, Irene Larraz, Giovanni Zagni, Ruben Miguez, Zhuohan Xie, Iryna Gurevych, Elizabeth Churchill, Tanmoy Chakraborty, Preslav Nakov
Title: Can LLMs Automate Fact-Checking Article Writing?
Abstract:
Automatic fact-checking aims to support professional fact-checkers by offering tools that can help speed up manual fact-checking. Yet, existing frameworks fail to address the key step of producing output suitable for broader dissemination to the general public: while human fact-checkers communicate their findings through fact-checking articles, automated systems typically produce little or no justification for their assessments. Here, we aim to bridge this gap. We argue for the need to extend the typical automatic fact-checking pipeline with automatic generation of full fact-checking articles. We first identify key desiderata for such articles through a series of interviews with experts from leading fact-checking organizations. We then develop QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. Finally, we assess the practical usefulness of QRAFT through human evaluations with professional fact-checkers. Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles. We hope that our work will enable further research in this new and important direction.
中文: 当前自动事实核查系统尚无法生成面向公众的完整核查文章,为此开发的QRAFT框架虽能模拟人工流程并优于以往方法,但仍显著落后于专家撰写的文章。
English: Automatic fact-checking systems currently lack the ability to generate comprehensive articles for public dissemination, leading to the development of QRAFT, an LLM-based framework that mimics human workflows but still falls short of expert-written content despite outperforming previous methods.

Authors:Ali Rabeh, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: 3D Neural Operator-Based Flow Surrogates around 3D geometries: Signed Distance Functions and Derivative Constraints
Abstract:
Accurate modeling of fluid dynamics around complex geometries is critical for applications such as aerodynamic optimization and biomedical device design. While advancements in numerical methods and high-performance computing have improved simulation capabilities, the computational cost of high-fidelity 3D flow simulations remains a significant challenge. Scientific machine learning (SciML) offers an efficient alternative, enabling rapid and reliable flow predictions. In this study, we evaluate Deep Operator Networks (DeepONet) and Geometric-DeepONet, a variant that incorporates geometry information via signed distance functions (SDFs), on steady-state 3D flow over complex objects. Our dataset consists of 1,000 high-fidelity simulations spanning Reynolds numbers from 10 to 1,000, enabling comprehensive training and evaluation across a range of flow regimes. To assess model generalization, we test our models on a random and extrapolatory train-test splitting. Additionally, we explore a derivative-informed training strategy that augments standard loss functions with velocity gradient penalties and incompressibility constraints, improving physics consistency in 3D flow prediction. Our results show that Geometric-DeepONet improves boundary-layer accuracy by up to 32% compared to standard DeepONet. Moreover, incorporating derivative constraints enhances gradient accuracy by 25% in interpolation tasks and up to 45% in extrapolatory test scenarios, suggesting significant improvement in generalization capabilities to unseen 3D Reynolds numbers.
中文: 本研究证明,结合几何符号距离函数和导数约束训练的Geometric-DeepONet模型,在复杂几何体三维流动预测中显著提升了精度和泛化能力,边界层精度最高提升32%,外推场景中梯度精度最高提升45%。
English: This study demonstrates that Geometric-DeepONet, enhanced with geometry-informed signed distance functions and derivative-informed training, significantly improves accuracy and generalization in 3D flow predictions over complex objects, achieving up to 32% better boundary-layer accuracy and 45% enhanced gradient accuracy in extrapolation scenarios.

Authors:Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Xinyu Zhang, Fangzhi Xu, Qika Lin, Rui Mao, Erik Cambria, Jun Liu
Title: MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
Abstract:
Multimodal scientific problems (MSPs) involve complex issues that require the integration of multiple modalities, such as text and diagrams, presenting a significant challenge in artificial intelligence. While progress has been made in addressing traditional scientific problems, MSPs still face two primary issues: the challenge of multi-modal comprehensive reasoning in scientific problem-solving and the lack of reflective and rethinking capabilities. To address these issues, we introduce a Multi-Agent framework based on the Big Seven Personality and Socratic guidance (MAPS). This framework employs seven distinct agents that leverage feedback mechanisms and the Socratic method to guide the resolution of MSPs. To tackle the first issue, we propose a progressive four-agent solving strategy, where each agent focuses on a specific stage of the problem-solving process. For the second issue, we introduce a Critic agent, inspired by Socratic questioning, which prompts critical thinking and stimulates autonomous learning. We conduct extensive experiments on the EMMA, Olympiad, and MathVista datasets, achieving promising results that outperform the current SOTA model by 15.84% across all tasks. Meanwhile, the additional analytical experiments also verify the model's progress as well as generalization ability.
中文摘要:MAPS框架通过七个专业智能体,采用渐进式解决策略和苏格拉底式批判提问来处理多模态科学问题,在多个数据集上性能超越现有最佳模型15.84%,同时验证了模型的泛化能力。
English Summary: The MAPS framework addresses multimodal scientific problems by employing seven specialized agents that enhance reasoning through progressive strategies and Socratic-inspired critical questioning, achieving a 15.84% performance improvement over state-of-the-art models.

Authors:Koki Hirooka, Abu Saleh Musa Miah, Tatsuya Murakami, Yuto Akiba, Yong Seok Hwang, Jungpil Shin
Title: Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition
Abstract:
Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. Existing SLR systems perform well for their cultural SL but may struggle with multi-cultural sign languages (McSL). To address these challenges, this paper proposes a Stack Spatial-Temporal Transformer Network that leverages multi-head attention mechanisms to capture both spatial and temporal dependencies with hierarchical features using the Stack Transfer concept. In the proceed, firstly, we applied a fully connected layer to make a embedding vector which has high expressive power from the original dataset, then fed them a stack newly proposed transformer to achieve hierarchical features with short-range and long-range dependency. The network architecture is composed of several stages that process spatial and temporal relationships sequentially, ensuring effective feature extraction. After making the fully connected layer, the embedding vector is processed by the Spatial Multi-Head Attention Transformer, which captures spatial dependencies between joints. In the next stage, the Temporal Multi-Head Attention Transformer captures long-range temporal dependencies, and again, the features are concatenated with the output using another skip connection. The processed features are then passed to the Feed-Forward Network (FFN), which refines the feature representations further. After the FFN, additional skip connections are applied to combine the output with earlier layers, followed by a final normalization layer to produce the final output feature tensor. This process is repeated for 10 transformer blocks. The extensive experiment shows that the JSL, KSL and ASL datasets achieved good performance accuracy. Our approach demonstrates improved performance in McSL, and it will be consider as a novel work in this domain.
中文: 本文提出了一种堆叠时空变换器网络,通过多头注意力机制捕获层次化时空特征,在日语、韩语和美国手语数据集上实现了对多文化手语识别的性能提升。
English: This paper introduces a Stack Spatial-Temporal Transformer Network that uses multi-head attention to capture hierarchical spatial-temporal features, demonstrating improved performance for multi-cultural sign language recognition across JSL, KSL, and ASL datasets.

Authors:Maoji Zheng, Ziyu Xu, Qiming Xia, Hai Wu, Chenglu Wen, Cheng Wang
Title: Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision
Abstract:
LiDAR-based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point-cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.
本文提出Seg2Box方法,通过仅使用语义标签进行3D物体检测来消除标签冗余,采用聚类和自训练模块在主流数据集上实现了卓越性能。
This paper introduces Seg2Box, a novel method that eliminates label redundancy by using only semantic labels for 3D object detection, employing clustering and self-training modules to achieve superior performance on major datasets.

Authors:Niki van Stein, Anna V. Kononova, Lars Kotthoff, Thomas Bäck
Title: Code Evolution Graphs: Understanding Large Language Model Driven Design of Algorithms
Abstract:
Large Language Models (LLMs) have demonstrated great promise in generating code, especially when used inside an evolutionary computation framework to iteratively optimize the generated algorithms. However, in some cases they fail to generate competitive algorithms or the code optimization stalls, and we are left with no recourse because of a lack of understanding of the generation process and generated codes. We present a novel approach to mitigate this problem by enabling users to analyze the generated codes inside the evolutionary process and how they evolve over repeated prompting of the LLM. We show results for three benchmark problem classes and demonstrate novel insights. In particular, LLMs tend to generate more complex code with repeated prompting, but additional complexity can hurt algorithmic performance in some cases. Different LLMs have different coding ``styles'' and generated code tends to be dissimilar to other LLMs. These two findings suggest that using different LLMs inside the code evolution frameworks might produce higher performing code than using only one LLM.
中文: 大语言模型在进化计算框架中生成代码具有潜力,但常因性能不足或优化停滞而受限;新方法通过分析代码演变和提示效果应对此问题,发现复杂度增加可能损害性能,且使用不同模型可提升代码质量。
English: Large Language Models show promise in code generation within evolutionary frameworks but face challenges with performance and optimization, which a new approach addresses by allowing users to analyze code evolution and prompting effects, revealing that increased complexity can harm performance and that using diverse LLMs may yield better results.

Authors:Shijing Chen, Shoaib Jameel, Mohamed Reda Bouadjenek, Feilong Tang, Usman Naseem, Basem Suleiman, Hakim Hacid, Flora D. Salim, Imran Razzak
Title: Enforcing Consistency and Fairness in Multi-level Hierarchical Classification with a Mask-based Output Layer
Abstract:
Traditional Multi-level Hierarchical Classification (MLHC) classifiers often rely on backbone models with $n$ independent output layers. This structure tends to overlook the hierarchical relationships between classes, leading to inconsistent predictions that violate the underlying taxonomy. Additionally, once a backbone architecture for an MLHC classifier is selected, adapting the model to accommodate new tasks can be challenging. For example, incorporating fairness to protect sensitive attributes within a hierarchical classifier necessitates complex adjustments to maintain the class hierarchy while enforcing fairness constraints. In this paper, we extend this concept to hierarchical classification by introducing a fair, model-agnostic layer designed to enforce taxonomy and optimize specific objectives, including consistency, fairness, and exact match. Our evaluations demonstrate that the proposed layer not only improves the fairness of predictions but also enforces the taxonomy, resulting in consistent predictions and superior performance. Compared to Large Language Models (LLMs) employing in-processing de-biasing techniques and models without any bias correction, our approach achieves better outcomes in both fairness and accuracy, making it particularly valuable in sectors like e-commerce, healthcare, and education, where predictive reliability is crucial.
中文: 传统多层分级分类器常忽略类间层级关系导致预测不一致,而我们提出的公平、模型无关层能强化分类体系并优化公平性、一致性等目标,相比现有方法在公平性和准确性上均表现更优。
English: Traditional MLHC classifiers often ignore hierarchical relationships, causing inconsistent predictions, but our proposed fair, model-agnostic layer enforces taxonomy and optimizes objectives like fairness and consistency, achieving superior results in both fairness and accuracy compared to existing methods.

Authors:Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi
Title: Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models
Abstract:
In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.
中文摘要:大型语言模型和视觉语言模型(如GPT-4o和Phi-3 Vision)能有效处理目标明确的人机交互启动,但在需要平衡人类与机器人情境的开放式场景中仍面临挑战。
English Summary: Large language and vision models like GPT-4o and Phi-3 Vision can effectively initiate human-robot interactions in clear scenarios but struggle with open-ended situations requiring nuanced balance between human and robot contexts.

Authors:Prashant Kumar Choudhary, Nouhaila Innan, Muhammad Shafique, Rajeev Singh
Title: HQNN-FSP: A Hybrid Classical-Quantum Neural Network for Regression-Based Financial Stock Market Prediction
Abstract:
Financial time-series forecasting remains a challenging task due to complex temporal dependencies and market fluctuations. This study explores the potential of hybrid quantum-classical approaches to assist in financial trend prediction by leveraging quantum resources for improved feature representation and learning. A custom Quantum Neural Network (QNN) regressor is introduced, designed with a novel ansatz tailored for financial applications. Two hybrid optimization strategies are proposed: (1) a sequential approach where classical recurrent models (RNN/LSTM) extract temporal dependencies before quantum processing, and (2) a joint learning framework that optimizes classical and quantum parameters simultaneously. Systematic evaluation using TimeSeriesSplit, k-fold cross-validation, and predictive error analysis highlights the ability of these hybrid models to integrate quantum computing into financial forecasting workflows. The findings demonstrate how quantum-assisted learning can contribute to financial modeling, offering insights into the practical role of quantum resources in time-series analysis.
金融时间序列预测具有挑战性,本研究通过引入混合量子经典模型和定制量子神经网络,利用优化的特征学习和时序依赖提取来提升趋势预测能力。
Financial time-series forecasting is challenging, but this study introduces hybrid quantum-classical models with a custom Quantum Neural Network to improve trend prediction through optimized feature learning and temporal dependency extraction.

Authors:Qingsen Yan, Tao Hu, Genggeng Chen, Wei Dong, Yanning Zhang
Title: Boosting HDR Image Reconstruction via Semantic Knowledge Transfer
Abstract:
Recovering High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB Standard Dynamic Range (SDR) images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a semantic knowledge alignment module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our method can significantly improve the HDR imaging quality of existing methods.
中文: 该框架通过自蒸馏和语义对齐模块,将来自标准动态范围图像的语义知识迁移至高动态范围重建,有效提升现有方法对退化低动态范围图像的成像质量。
English: The proposed framework enhances HDR image reconstruction from degraded LDR images by transferring semantic knowledge from SDR domains through self-distillation and a semantic alignment module, significantly improving imaging quality.

Authors:Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang
Title: DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Abstract:
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
中文: 作者提出了DAPO算法和开源强化学习系统,在Qwen2.5-32B上实现了顶尖推理性能,同时完整公开关键技术细节与代码,以提升大语言模型强化学习的可复现性。
English: The authors introduce the DAPO algorithm and an open-source reinforcement learning system that achieves state-of-the-art reasoning performance on Qwen2.5-32B, while fully disclosing key technical details and code to enhance reproducibility in large-scale LLM training.

Authors:Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, Jie Chen
Title: MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation
Abstract:
Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.
中文: MagicComp是一种无需训练的方法,通过双阶段优化解决多主体间的语义模糊性并动态绑定时空区域,显著提升了文本到视频生成的组合能力,性能优于现有技术。
English: MagicComp is a training-free method that enhances text-to-video generation by resolving subject ambiguity and dynamically binding subjects to spatiotemporal regions through dual-phase refinement, outperforming existing methods.

Authors:Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie
Title: Towards Harmless Multimodal Assistants with Blind Preference Optimization
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.
中文摘要:本研究提出MMSafe-PO数据集和盲偏好优化方法,显著提升多模态大语言模型的安全性能,在基准测试中实现45%的安全率提升并展现卓越鲁棒性。
English Summary: The study introduces the MMSafe-PO dataset and Blind Preference Optimization (BPO) method to enhance multimodal large language models' safety, achieving a 45% safety improvement and demonstrating robustness across benchmarks.

Authors:Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian
Title: Fast Autoregressive Video Generation with Diagonal Decoding
Abstract:
Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
Chinese: 提出的对角线解码(DiagD)方法通过沿时空对角线并行解码令牌,将自回归视频生成速度提升高达10倍,同时保持与顺序解码相当的视觉质量。
English: The proposed Diagonal Decoding (DiagD) method accelerates autoregressive video generation by enabling parallel token decoding along spatial-temporal diagonals, achieving up to 10× speedup while maintaining visual quality comparable to sequential decoding.

Authors:Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Title: Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion
Abstract:
How can we effectively unlearn selected concepts from pre-trained generative foundation models without resorting to extensive retraining? This research introduces `continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models, incrementally. We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts while preserving the generation of related, non-targeted concepts and alleviating generalization erosion. For this, DUGE targets three losses: a cross-attention loss that steers the focus towards images devoid of the target concept; a prior-preservation loss that safeguards knowledge related to non-target concepts; and a regularization loss that prevents the model from suffering from generalization erosion. Experimental results demonstrate the ability of the proposed approach to exclude certain concepts without compromising the overall integrity and performance of the model. This offers a pragmatic solution for refining generative models, adeptly handling the intricacies of model training and concept management lowering the risks of copyright infringement, personal or licensed material misuse, and replication of distinctive artistic styles. Importantly, it maintains the non-targeted concepts, thereby safeguarding the model's core capabilities and effectiveness.
Chinese: 本研究提出DUGE算法,通过持续遗忘机制选择性消除生成模型中特定概念,同时保护非目标概念并利用针对性损失防止泛化能力下降。
English: This research introduces the DUGE algorithm for continual unlearning, which selectively removes undesired concepts from generative models while preserving related ones and preventing generalization erosion through targeted losses.

Authors:Zi Haur Pang, Yahui Fu, Divesh Lala, Mikey Elmers, Koji Inoue, Tatsuya Kawahara
Title: Does the Appearance of Autonomous Conversational Robots Affect User Spoken Behaviors in Real-World Conference Interactions?
Abstract:
We investigate the impact of robot appearance on users' spoken behavior during real-world interactions by comparing a human-like android, ERICA, with a less anthropomorphic humanoid, TELECO. Analyzing data from 42 participants at SIGDIAL 2024, we extracted linguistic features such as disfluencies and syntactic complexity from conversation transcripts. The results showed moderate effect sizes, suggesting that participants produced fewer disfluencies and employed more complex syntax when interacting with ERICA. Further analysis involving training classification models like Naïve Bayes, which achieved an F1-score of 71.60\%, and conducting feature importance analysis, highlighted the significant role of disfluencies and syntactic complexity in interactions with robots of varying human-like appearances. Discussing these findings within the frameworks of cognitive load and Communication Accommodation Theory, we conclude that designing robots to elicit more structured and fluent user speech can enhance their communicative alignment with humans.
中文: 研究发现,与类人程度较低的机器人TELECO相比,参与者与高度仿真的机器人ERICA互动时言语更流畅、句法更复杂,表明机器人外观通过影响用户言语模式可促进人机沟通协调。
English: This study found that interacting with the more human-like robot ERICA led users to produce fewer disfluencies and more complex syntax compared to the less anthropomorphic TELECO, suggesting that robot appearance influences speech patterns and can enhance human-robot communicative alignment.

Authors:Chi Han, Heng Ji
Title: Computation Mechanism Behind LLM Position Generalization
Abstract:
Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs' computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs' position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs' internal mechanisms.
中文摘要:本研究揭示大型语言模型通过习得的计算机制实现位置泛化能力,其中注意力对数与位置相关性和语义重要性的算术近似值呈现0.959的线性相关性,为理解其内部文本位置处理机制提供了新见解。
English Summary: This study reveals that large language models achieve position generalization through learned computational mechanisms, where attention logits show strong linear correlation with the combined effects of positional relevance and semantic importance, providing new insights into their internal processing of textual positions.

Authors:Yunqi Shi, Chengrui Gao, Wanqi Ren, Siyuan Xu, Ke Xue, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou
Title: Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation
Abstract:
This work introduces Open3DBench, an open-source 3D-IC backend implementation benchmark built upon the OpenROAD-flow-scripts framework, enabling comprehensive evaluation of power, performance, area, and thermal metrics. Our proposed flow supports modular integration of 3D partitioning, placement, 3D routing, RC extraction, and thermal simulation, aligning with advanced 3D flows that rely on commercial tools and in-house scripts. We present two foundational 3D placement algorithms: Open3D-Tiling, which emphasizes regular macro placement, and Open3D-DMP, which enhances wirelength optimization through cross-die co-placement with analytical placer DREAMPlace. Experimental results show significant improvements in area (51.19%), wirelength (24.06%), timing (30.84%), and power (5.72%) compared to 2D flows. The results also highlight that better wirelength does not necessarily lead to PPA gain, emphasizing the need of developing PPA-driven methods. Open3DBench offers a standardized, reproducible platform for evaluating 3D EDA methods, effectively bridging the gap between open-source tools and commercial solutions in 3D-IC design.
中文: Open3DBench是一个基于开源框架的3D-IC基准测试平台,通过集成完整设计流程实现了对功耗、性能、面积和热效应的综合评估,相比传统2D方案在多项指标上取得显著提升,为3D EDA方法提供了标准化评估基准。
English: Open3DBench is an open-source 3D-IC benchmark that enables comprehensive evaluation of power, performance, area, and thermal metrics, demonstrating significant improvements over 2D flows and providing a standardized platform for 3D EDA method assessment.

Authors:Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, Dacheng Tao
Title: R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Abstract:
Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.
中文: 本研究提出StepGRPO强化学习框架,通过逐步奖励推理准确性和有效性来增强MLLMs的推理能力,超越简单模仿正确推理路径,开发出具有卓越逐步推理能力的R1-VL模型系列。
English: This research introduces StepGRPO, a reinforcement learning framework that enhances MLLMs' reasoning by rewarding step-wise accuracy and validity, moving beyond mere imitation of correct reasoning paths to develop superior step-by-step reasoning models called R1-VL.

Authors:Micol Spitale, Srikar Babu, Serhan Cakmak, Jiaee Cheong, Hatice Gunes
Title: Exploring Causality for HRI: A Case Study on Robotic Mental Well-being Coaching
Abstract:
One of the primary goals of Human-Robot Interaction (HRI) research is to develop robots that can interpret human behavior and adapt their responses accordingly. Adaptive learning models, such as continual and reinforcement learning, play a crucial role in improving robots' ability to interact effectively in real-world settings. However, these models face significant challenges due to the limited availability of real-world data, particularly in sensitive domains like healthcare and well-being. This data scarcity can hinder a robot's ability to adapt to new situations. To address these challenges, causality provides a structured framework for understanding and modeling the underlying relationships between actions, events, and outcomes. By moving beyond mere pattern recognition, causality enables robots to make more explainable and generalizable decisions. This paper presents an exploratory causality-based analysis through a case study of an adaptive robotic coach delivering positive psychology exercises over four weeks in a workplace setting. The robotic coach autonomously adapts to multimodal human behaviors, such as facial valence and speech duration. By conducting both macro- and micro-level causal analyses, this study aims to gain deeper insights into how adaptability can enhance well-being during interactions. Ultimately, this research seeks to advance our understanding of how causality can help overcome challenges in HRI, particularly in real-world applications.
中文: 人机交互研究通过因果分析开发自适应机器人,以解决现实数据稀缺问题并提升决策能力,如自主机器人教练案例所示,旨在增进交互中的幸福感。
English: Human-Robot Interaction research aims to develop adaptive robots using causal analysis to overcome data scarcity and enhance decision-making in sensitive domains like well-being, as demonstrated through a case study of an autonomous robotic coach.

Authors:Yunqi Shi, Siyuan Xu, Shixiong Kai, Xi Lin, Ke Xue, Mingxuan Yuan, Chao Qian
Title: Timing-Driven Global Placement by Efficient Critical Path Extraction
Abstract:
Timing optimization during the global placement of integrated circuits has been a significant focus for decades, yet it remains a complex, unresolved issue. Recent analytical methods typically use pin-level timing information to adjust net weights, which is fast and simple but neglects the path-based nature of the timing graph. The existing path-based methods, however, cannot balance the accuracy and efficiency due to the exponential growth of number of critical paths. In this work, we propose a GPU-accelerated timing-driven global placement framework, integrating accurate path-level information into the efficient DREAMPlace infrastructure. It optimizes the fine-grained pin-to-pin attraction objective and is facilitated by efficient critical path extraction. We also design a quadratic distance loss function specifically to align with the RC timing model. Experimental results demonstrate that our method significantly outperforms the current leading timing-driven placers, achieving an average improvement of 40.5% in total negative slack (TNS) and 8.3% in worst negative slack (WNS), as well as an improvement in half-perimeter wirelength (HPWL).
中文: 本研究提出了一种GPU加速的时序驱动全局布局框架,将路径级时序信息整合到DREAMPlace中,相比现有方法在时序指标和线长方面均实现了显著提升。
English: This study introduces a GPU-accelerated timing-driven global placement framework that integrates path-level timing information into DREAMPlace, achieving significant improvements in timing metrics and wirelength compared to existing methods.

Authors:Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, Jose M. Alvarez
Title: Centaur: Robust End-to-End Autonomous Driving with Test-Time Training
Abstract:
How can we rely on an end-to-end autonomous vehicle's complex decision-making system during deployment? One common solution is to have a ``fallback layer'' that checks the planned trajectory for rule violations and replaces it with a pre-defined safe action if necessary. Another approach involves adjusting the planner's decisions to minimize a pre-defined ``cost function'' using additional system predictions such as road layouts and detected obstacles. However, these pre-programmed rules or cost functions cannot learn and improve with new training data, often resulting in overly conservative behaviors. In this work, we propose Centaur (Cluster Entropy for Test-time trAining using Uncertainty) which updates a planner's behavior via test-time training, without relying on hand-engineered rules or cost functions. Instead, we measure and minimize the uncertainty in the planner's decisions. For this, we develop a novel uncertainty measure, called Cluster Entropy, which is simple, interpretable, and compatible with state-of-the-art planning algorithms. Using data collected at prior test-time time-steps, we perform an update to the model's parameters using a gradient that minimizes the Cluster Entropy. With only this sole gradient update prior to inference, Centaur exhibits significant improvements, ranking first on the navtest leaderboard with notable gains in safety-critical metrics such as time to collision. To provide detailed insights on a per-scenario basis, we also introduce navsafe, a challenging new benchmark, which highlights previously undiscovered failure modes of driving models.
中文: Centaur提出了一种测试时训练方法,通过最小化聚类熵这一新型不确定性度量来更新规划器的行为,无需依赖预设规则或成本函数,从而提升自动驾驶决策能力,在安全关键指标上表现卓越。
English: Centaur introduces a test-time training method that updates a planner's behavior by minimizing cluster entropy, a novel uncertainty measure, to enhance autonomous vehicle decision-making without relying on pre-set rules or cost functions, achieving top performance in safety metrics.

Authors:Siyuan Huang, Yue Liao, Siyuan Feng, Shu Jiang, Si Liu, Hongsheng Li, Maoqing Yao, Guanghui Ren
Title: Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning
Abstract:
The pursuit of data efficiency, where quality outweighs quantity, has emerged as a cornerstone in robotic manipulation, especially given the high costs associated with real-world data collection. We propose that maximizing the informational density of individual demonstrations can dramatically reduce reliance on large-scale datasets while improving task performance. To this end, we introduce Adversarial Data Collection, a Human-in-the-Loop (HiL) framework that redefines robotic data acquisition through real-time, bidirectional human-environment interactions. Unlike conventional pipelines that passively record static demonstrations, ADC adopts a collaborative perturbation paradigm: during a single episode, an adversarial operator dynamically alters object states, environmental conditions, and linguistic commands, while the tele-operator adaptively adjusts actions to overcome these evolving challenges. This process compresses diverse failure-recovery behaviors, compositional task variations, and environmental perturbations into minimal demonstrations. Our experiments demonstrate that ADC-trained models achieve superior compositional generalization to unseen task instructions, enhanced robustness to perceptual perturbations, and emergent error recovery capabilities. Strikingly, models trained with merely 20% of the demonstration volume collected through ADC significantly outperform traditional approaches using full datasets. These advances bridge the gap between data-centric learning paradigms and practical robotic deployment, demonstrating that strategic data acquisition, not merely post-hoc processing, is critical for scalable, real-world robot learning. Additionally, we are curating a large-scale ADC-Robotics dataset comprising real-world manipulation tasks with adversarial perturbations. This benchmark will be open-sourced to facilitate advancements in robotic imitation learning.
中文: 该研究提出对抗性数据收集这一人机交互框架,通过将多样化挑战压缩到少量演示中提升机器人操作性能,仅需传统数据量的20%即可实现更优效果。
English: The study introduces Adversarial Data Collection, a human-in-the-loop framework that enhances robotic manipulation by compressing diverse challenges into minimal demonstrations, achieving superior performance with only 20% of traditional data volume.

Authors:Avinash Madasu, Vasudev Lal, Phillip Howard
Title: Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias
Abstract:
CLIP is one of the most popular foundation models and is heavily used for many vision-language tasks, yet little is known about its inner workings. As CLIP is increasingly deployed in real-world applications, it is becoming even more critical to understand its limitations and embedded social biases to mitigate potentially harmful downstream consequences. However, the question of what internal mechanisms drive both the impressive capabilities as well as problematic shortcomings of CLIP has largely remained unanswered. To bridge this gap, we study the conceptual consistency of text descriptions for attention heads in CLIP-like models. Specifically, we propose Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. Our soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, we find that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding. Moreover, we prove that high CCS heads learn spurious correlations which amplify social biases. These results position CCS as a powerful interpretability metric exposing the paradox of performance and social biases in CLIP models.
Chinese: 尽管CLIP模型被广泛应用,其内部机制仍不明确,为此我们提出概念一致性评分(CCS)来衡量注意力头与概念的对齐程度,发现高CCS头对性能至关重要但会放大社会偏见。
English: CLIP's internal mechanisms remain largely unknown despite its widespread use, prompting the development of the Concept Consistency Score (CCS) to measure how consistently attention heads align with concepts, revealing that high-CCS heads are crucial for performance but also amplify social biases.

Authors:Yuhao Wang, Yongfeng Lv, Pingping Zhang, Huchuan Lu
Title: IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
Abstract:
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
中文摘要:本研究提出IDEA框架,通过逆向多模态特征提取器和协作可变形聚合机制,整合文本语义并自适应关注判别性特征,从而增强多模态目标重识别性能,并在新建基准上验证了其有效性。
English Summary: The study introduces IDEA, a feature learning framework with an Inverted Multi-modal Feature Extractor and Cooperative Deformable Aggregation, to enhance multi-modal object ReID by integrating text semantics and adaptively focusing on discriminative features, validated through newly constructed benchmarks.

Authors:Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu
Title: Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction
Abstract:
Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning models, particularly Long Short-Term Memory (LSTM) networks, offer transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges including fairness, uncertainty, interpretability, robustness, generalizability, and reproducibility. In this work, we present the first comprehensive evaluation of trustworthiness in a continental-scale multi-task LSTM model predicting 20 water quality variables (encompassing physical/chemical processes, geochemical weathering, and nutrient cycling) across 482 U.S. basins. Our investigation uncovers systematic patterns of model performance disparities linked to basin characteristics, the inherent complexity of biogeochemical processes, and variable predictability, emphasizing critical performance fairness concerns. We further propose methodological frameworks for quantitatively evaluating critical aspects of trustworthiness, including uncertainty, interpretability, and robustness, identifying key limitations that could challenge reliable real-world deployment. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.
中文: 深度学习模型在大规模水质预测中潜力巨大,但面临公平性、不确定性和可解释性等可信度挑战,本研究对此进行评估并提出方法框架,以推动人工智能在环境管理中的负责任应用。
English: Deep learning models like LSTM show great promise for large-scale water quality prediction but face trustworthiness challenges in fairness, uncertainty, and interpretability, which this study evaluates and addresses to promote responsible AI use in environmental management.

Authors:Ryan Quek Wei Heng, Edoardo Vittori, Keane Ong, Rui Mao, Erik Cambria, Gianmarco Mengaldo
Title: Leveraging LLMS for Top-Down Sector Allocation In Automated Trading
Abstract:
This paper introduces a methodology leveraging Large Language Models (LLMs) for sector-level portfolio allocation through systematic analysis of macroeconomic conditions and market sentiment. Our framework emphasizes top-down sector allocation by processing multiple data streams simultaneously, including policy documents, economic indicators, and sentiment patterns. Empirical results demonstrate superior risk-adjusted returns compared to traditional cross momentum strategies, achieving a Sharpe ratio of 2.51 and portfolio return of 8.79% versus -0.61 and -1.39% respectively. These results suggest that LLM-based systematic macro analysis presents a viable approach for enhancing automated portfolio allocation decisions at the sector level.
中文: 本文提出了一种利用大型语言模型(LLMs)通过分析宏观经济状况和市场情绪进行行业层面投资组合配置的方法,相比传统策略实现了更优的风险调整后收益。
English: This paper presents a methodology using Large Language Models (LLMs) for sector-level portfolio allocation by analyzing macroeconomic conditions and market sentiment, achieving superior risk-adjusted returns compared to traditional strategies.

Authors:Siddhant Dutta, Nouhaila Innan, Khadijeh Najafi, Sadok Ben Yahia, Muhammad Shafique
Title: QUIET-SR: Quantum Image Enhancement Transformer for Single Image Super-Resolution
Abstract:
Recent advancements in Single-Image Super-Resolution (SISR) using deep learning have significantly improved image restoration quality. However, the high computational cost of processing high-resolution images due to the large number of parameters in classical models, along with the scalability challenges of quantum algorithms for image processing, remains a major obstacle. In this paper, we propose the Quantum Image Enhancement Transformer for Super-Resolution (QUIET-SR), a hybrid framework that extends the Swin transformer architecture with a novel shifted quantum window attention mechanism, built upon variational quantum neural networks. QUIET-SR effectively captures complex residual mappings between low-resolution and high-resolution images, leveraging quantum attention mechanisms to enhance feature extraction and image restoration while requiring a minimal number of qubits, making it suitable for the Noisy Intermediate-Scale Quantum (NISQ) era. We evaluate our framework in MNIST (30.24 PSNR, 0.989 SSIM), FashionMNIST (29.76 PSNR, 0.976 SSIM) and the MedMNIST dataset collection, demonstrating that QUIET-SR achieves PSNR and SSIM scores comparable to state-of-the-art methods while using fewer parameters. These findings highlight the potential of scalable variational quantum machine learning models for SISR, marking a step toward practical quantum-enhanced image super-resolution.
中文摘要:提出的QUIET-SR框架将量子计算与变换器架构相结合,在显著减少参数使用的同时,实现了与传统方法相当的图像超分辨率性能。
English Summary: The proposed QUIET-SR framework combines quantum computing with transformer architecture to achieve efficient super-resolution performance comparable to classical methods while using significantly fewer parameters.

Authors:Samundra Karki, Mehdi Shadkah, Cheng-Hau Yang, Aditya Balu, Guglielmo Scovazzi, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: Direct Flow Simulations with Implicit Neural Representation of Complex Geometry
Abstract:
Implicit neural representations have emerged as a powerful approach for encoding complex geometries as continuous functions. These implicit models are widely used in computer vision and 3D content creation, but their integration into scientific computing workflows, such as finite element or finite volume simulations, remains limited. One reason is that conventional simulation pipelines require explicit geometric inputs (meshes), forcing INR-based shapes to be converted to meshes--a step that introduces approximation errors, computational overhead, and significant manual effort. Immersed boundary methods partially alleviate this issue by allowing simulations on background grids without body-fitted meshes. However, they still require an explicit boundary description and can suffer from numerical artifacts, such as sliver cut cells. The shifted boundary method (SBM) eliminates the need for explicit geometry by using grid-aligned surrogate boundaries, making it inherently compatible with implicit shape representations. Here, we present a framework that directly couples neural implicit geometries with SBM to perform high-fidelity fluid flow simulations without any intermediate mesh generation. By leveraging neural network inference, our approach computes the surrogate boundary and distance vectors required by SBM on-the-fly directly from the INR, thus completely bypassing traditional geometry processing. We demonstrate this approach on canonical 2D and 3D flow benchmarks (lid-driven cavity flows) and complex geometries (gyroids, the Stanford bunny, and AI-generated shapes), achieving simulation accuracy comparable to conventional mesh-based methods. This work highlights a novel pathway for integrating AI-driven geometric representations into computational physics, establishing INRs as a versatile and scalable tool for simulations and removing a long-standing bottleneck in geometry handling.
中文: 本研究提出了一种将神经隐式几何与移位边界法直接结合的框架,无需网格生成即可实现高保真流体流动模拟,在绕过几何处理瓶颈的同时达到了与传统方法相当的精度。
English: This study introduces a framework that directly integrates neural implicit geometries with the shifted boundary method to enable high-fidelity fluid flow simulations without mesh generation, achieving accuracy comparable to traditional methods while bypassing geometry processing bottlenecks.

Authors:Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
Title: HASARD: A Benchmark for Vision-Based Safe Reinforcement Learning in Embodied Agents
Abstract:
Advancing safe autonomous systems through reinforcement learning (RL) requires robust benchmarks to evaluate performance, analyze methods, and assess agent competencies. Humans primarily rely on embodied visual perception to safely navigate and interact with their surroundings, making it a valuable capability for RL agents. However, existing vision-based 3D benchmarks only consider simple navigation tasks. To address this shortcoming, we introduce \textbf{HASARD}, a suite of diverse and complex tasks to $\textbf{HA}$rness $\textbf{SA}$fe $\textbf{R}$L with $\textbf{D}$oom, requiring strategic decision-making, comprehending spatial relationships, and predicting the short-term future. HASARD features three difficulty levels and two action spaces. An empirical evaluation of popular baseline methods demonstrates the benchmark's complexity, unique challenges, and reward-cost trade-offs. Visualizing agent navigation during training with top-down heatmaps provides insight into a method's learning process. Incrementally training across difficulty levels offers an implicit learning curriculum. HASARD is the first safe RL benchmark to exclusively target egocentric vision-based learning, offering a cost-effective and insightful way to explore the potential and boundaries of current and future safe RL methods. The environments and baseline implementations are open-sourced at https://sites.google.com/view/hasard-bench/.
中文:HASARD基准测试套件通过提供需要战略决策和空间推理的复杂视觉3D任务,弥补了现有简单导航基准的不足,旨在推动安全强化学习的发展。
English: The HASARD benchmark is introduced to advance safe reinforcement learning by providing complex, vision-based 3D tasks that require strategic decision-making and spatial reasoning, addressing the limitations of existing simple navigation benchmarks.

Authors:Elvis Kimara, Mozhgan Hadadi, Jackson Godbersen, Aditya Balu, Talukder Jubery, Yawei Li, Adarsh Krishnamurthy, Patrick S. Schnable, Baskar Ganapathysubramanian
Title: MaizeField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel
Abstract:
The development of artificial intelligence (AI) and machine learning (ML) based tools for 3D phenotyping, especially for maize, has been limited due to the lack of large and diverse 3D datasets. 2D image datasets fail to capture essential structural details such as leaf architecture, plant volume, and spatial arrangements that 3D data provide. To address this limitation, we present MaizeField3D (https://baskargroup.github.io/MaizeField3D/), a curated dataset of 3D point clouds of field-grown maize plants from a diverse genetic panel, designed to be AI-ready for advancing agricultural research. Our dataset includes 1,045 high-quality point clouds of field-grown maize collected using a terrestrial laser scanner (TLS). Point clouds of 520 plants from this dataset were segmented and annotated using a graph-based segmentation method to isolate individual leaves and stalks, ensuring consistent labeling across all samples. This labeled data was then used for fitting procedural models that provide a structured parametric representation of the maize plants. The leaves of the maize plants in the procedural models are represented using Non-Uniform Rational B-Spline (NURBS) surfaces that were generated using a two-step optimization process combining gradient-free and gradient-based methods. We conducted rigorous manual quality control on all datasets, correcting errors in segmentation, ensuring accurate leaf ordering, and validating metadata annotations. The dataset also includes metadata detailing plant morphology and quality, alongside multi-resolution subsampled point cloud data (100k, 50k, 10k points), which can be readily used for different downstream computational tasks. MaizeField3D will serve as a comprehensive foundational dataset for AI-driven phenotyping, plant structural analysis, and 3D applications in agricultural research.
中文: MaizeField3D数据集通过提供1045个带标注的田间玉米三维点云数据,解决了三维表型分析数据匮乏的问题,其分割的叶片茎秆结构和多分辨率数据为农业人工智能研究奠定了基础。
English: The MaizeField3D dataset addresses the scarcity of diverse 3D maize data by providing 1,045 annotated point clouds with segmented leaves and stalks, enabling AI-driven agricultural research through structured parametric models and multi-resolution data.

Authors:José Gonçalves, Miguel Silva, Bernardo Cabral, Tiago Dias, Eva Maia, Isabel Praça, Ricardo Severino, Luís Lino Ferreira
Title: Evaluating LLaMA 3.2 for Software Vulnerability Detection
Abstract:
Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.
中文: 本研究提出了经过预处理的优化版DiverseVul数据集,通过解决原始数据不一致性问题,成功微调LLaMA 3.2模型,在漏洞检测任务中取得了66%的F1值,较47%的基准表现实现显著提升。
English: This work presents a refined version of the DiverseVul dataset that addresses inconsistencies through preprocessing, enabling fine-tuning of LLaMA 3.2 to achieve a competitive 66% F1-score in vulnerability detection—a significant improvement over the 47% baseline.

Authors:Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, Xuefeng Xiao
Title: RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories
Abstract:
Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.
中文: RayFlow是一种新颖的扩散框架,通过引导样本沿独特路径到达实例特定目标来加速生成,在保持质量和多样性的同时,引入时间采样器以提高训练效率。
English: RayFlow is a novel diffusion framework that accelerates generation by guiding samples along unique paths to instance-specific targets, preserving quality and diversity while introducing a Time Sampler for efficient training.

Authors:Luyi Jiang, Jiayuan Chen, Lu Lu, Xinwei Peng, Lihao Liu, Junjun He, Jie Xu
Title: Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Abstract:
The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.
中文摘要:本研究建立了医疗大语言模型的细粒度错误分类体系,揭示了八类核心缺陷并提出四层优化策略,旨在提升临床应用的可靠性与安全性。
English Summary: This study develops a granular error taxonomy for medical LLMs, identifying eight critical failure types and proposing a four-level optimization strategy to enhance clinical robustness and safety.

Authors:Xiaoyi Liang, Mouxiao Bian, Moxin Chen, Lihao Liu, Junjun He, Jie Xu, Lin Li
Title: A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images
Abstract:
In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.
中文: 多模态大语言模型在眼科应用中展现出潜力,但由于数据集有限且对不同疾病的诊断准确性存在差异,其在解读光学相干断层扫描图像时面临挑战,亟需建立临床相关基准以提升实用性。
English: Multimodal large language models show promise in ophthalmology but face challenges in accurately interpreting optical coherence tomography images due to limited datasets and inconsistent diagnostic performance across diseases, highlighting the need for clinically relevant benchmarks.

Authors:Tianai Huang, Lu Lu, Jiayuan Chen, Lihao Liu, Junjun He, Yuping Zhao, Wenchao Tang, Jie Xu
Title: TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine
Abstract:
Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.
中文: 本研究推出TCM3CEval基准,通过中医核心知识、典籍理解和临床决策三维度评估大语言模型,发现模型在专业领域存在不足且中文文化背景模型表现更优,为中医药领域人工智能评估建立了标准。
English: This study introduces TCM3CEval, a benchmark evaluating large language models' performance in traditional Chinese medicine across knowledge mastery, text interpretation, and clinical reasoning, revealing performance gaps and cultural dependencies while establishing evaluation standards for AI in TCM.

Authors:Zhi Qin, Qianhui Gui, Mouxiao Bian, Rui Wang, Hong Ge, Dandan Yao, Ziying Sun, Yuan Zhao, Yu Zhang, Hui Shi, Dongdong Wang, Chenxin Song, Shenghong Ju, Lihao Liu, Junjun He, Jie Xu, Yuan-Cheng Wang
Title: Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation
Abstract:
Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
中文摘要:本研究建立了标准化数据集和评估框架来评估大语言模型在医学影像质控中的应用,发现Gemini 2.0-Flash在胸片任务中表现优异,而DeepSeek-R1在CT报告审核中表现最佳。
English Summary: This study establishes a standardized dataset and evaluation framework to assess large language models for medical imaging quality control, finding that Gemini 2.0-Flash excels in chest X-ray tasks while DeepSeek-R1 performs best in CT report auditing.

Authors:Nasla Saleem, Talukder Zaki Jubery, Aditya Balu, Yan Zhou, Yawei Li, Patrick S. Schnable, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: Accessing the Effect of Phyllotaxy and Planting Density on Light Use Efficiency in Field-Grown Maize using 3D Reconstructions
Abstract:
High-density planting is a widely adopted strategy to enhance maize productivity, yet it introduces challenges such as increased interplant competition and shading, which can limit light capture and overall yield potential. In response, some maize plants naturally reorient their canopies to optimize light capture, a process known as canopy reorientation. Understanding this adaptive response and its impact on light capture is crucial for maximizing agricultural yield potential. This study introduces an end-to-end framework that integrates realistic 3D reconstructions of field-grown maize with photosynthetically active radiation (PAR) modeling to assess the effects of phyllotaxy and planting density on light interception. In particular, using 3D point clouds derived from field data, virtual fields for a diverse set of maize genotypes were constructed and validated against field PAR measurements. Using this framework, we present detailed analyses of the impact of canopy orientations, plant and row spacings, and planting row directions on PAR interception throughout a typical growing season. Our findings highlight significant variations in light interception efficiency across different planting densities and canopy orientations. By elucidating the relationship between canopy architecture and light capture, this study offers valuable guidance for optimizing maize breeding and cultivation strategies across diverse agricultural settings.
中文摘要:本研究开发了一个结合三维玉米重建与光能建模的框架,分析种植密度和冠层方向对光能捕获的影响,为优化玉米栽培和育种策略提供重要指导。
English Summary: This study develops a framework combining 3D maize reconstructions with light modeling to analyze how planting density and canopy orientation affect light capture, providing insights for optimizing maize cultivation and breeding.

Authors:Jianqi Zhang, Jingyao Wang, Xingchen Shen, Wenwen Qiang
Title: Enhancing Time Series Forecasting via Logic-Inspired Regularization
Abstract:
Time series forecasting (TSF) plays a crucial role in many applications. Transformer-based methods are one of the mainstream techniques for TSF. Existing methods treat all token dependencies equally. However, we find that the effectiveness of token dependencies varies across different forecasting scenarios, and existing methods ignore these differences, which affects their performance. This raises two issues: (1) What are effective token dependencies? (2) How can we learn effective dependencies? From a logical perspective, we align Transformer-based TSF methods with the logical framework and define effective token dependencies as those that ensure the tokens as atomic formulas (Issue 1). We then align the learning process of Transformer methods with the process of obtaining atomic formulas in logic, which inspires us to design a method for learning these effective dependencies (Issue 2). Specifically, we propose Attention Logic Regularization (Attn-L-Reg), a plug-and-play method that guides the model to use fewer but more effective dependencies by making the attention map sparse, thereby ensuring the tokens as atomic formulas and improving prediction performance. Extensive experiments and theoretical analysis confirm the effectiveness of Attn-L-Reg.
Chinese: 针对Transformer时间序列预测方法忽视不同令牌依赖有效性的问题,我们提出注意力逻辑正则化(Attn-L-Reg),通过稀疏注意力图确保令牌作为原子公式,从而提升预测性能。
English: Transformer-based time series forecasting methods often overlook varying token dependency effectiveness, so we propose Attention Logic Regularization (Attn-L-Reg) to enhance prediction by ensuring tokens act as atomic formulas through sparse attention maps.

Authors:Honglin Li, Zhongyi Shui, Yunlong Zhang, Chenglu Zhu, Lin Yang
Title: PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization
Abstract:
Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT$_{224}$). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.
中文: 本研究引入向量量化蒸馏方法,通过压缩全切片图像中的空间补丁标记,在保留关键空间细节的同时大幅降低存储和训练成本,从而提升癌症诊断和预后的分析性能。
English: The study introduces a vector quantized distillation method that compresses spatial patch tokens in whole-slide image analysis, significantly reducing storage and training costs while preserving critical spatial details for improved cancer diagnosis and prognosis.

Authors:Koji Inoue, Yuki Okafuji, Jun Baba, Yoshiki Ohira, Katsuya Hyodo, Tatsuya Kawahara
Title: A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
Abstract:
Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.
中文: 本研究提出了一种基于Transformer架构的抗噪声语音活动预测模型,通过商场实地实验证明,该系统能显著降低响应延迟,提升对话机器人实时交互的自然流畅度。
English: This study introduces a noise-robust voice activity projection model using Transformer architecture to improve real-time turn-taking in dialogue robots, which significantly reduces response latency and enhances conversational naturalness based on field experiments.

Authors:Artin Saberpour Abadian, Yi-Chi Liao, Ata Otaran, Rishabh Dabral, Marie Muehlhaus, Christian Theobalt, Martin Schmitz, Jürgen Steimle
Title: 3HANDS Dataset: Learning from Humans for Generating Naturalistic Handovers with Supernumerary Robotic Limbs
Abstract:
Supernumerary robotic limbs (SRLs) are robotic structures integrated closely with the user's body, which augment human physical capabilities and necessitate seamless, naturalistic human-machine interaction. For effective assistance in physical tasks, enabling SRLs to hand over objects to humans is crucial. Yet, designing heuristic-based policies for robots is time-consuming, difficult to generalize across tasks, and results in less human-like motion. When trained with proper datasets, generative models are powerful alternatives for creating naturalistic handover motions. We introduce 3HANDS, a novel dataset of object handover interactions between a participant performing a daily activity and another participant enacting a hip-mounted SRL in a naturalistic manner. 3HANDS captures the unique characteristics of SRL interactions: operating in intimate personal space with asymmetric object origins, implicit motion synchronization, and the user's engagement in a primary task during the handover. To demonstrate the effectiveness of our dataset, we present three models: one that generates naturalistic handover trajectories, another that determines the appropriate handover endpoints, and a third that predicts the moment to initiate a handover. In a user study (N=10), we compare the handover interaction performed with our method compared to a baseline. The findings show that our method was perceived as significantly more natural, less physically demanding, and more comfortable.
中文: 3HANDS数据集支持生成模型为超限机器人肢体创建自然的交接动作,用户评价显示该方法比基线方法显著更自然、舒适且体力消耗更低。
English: The 3HANDS dataset enables generative models to create naturalistic handover motions for supernumerary robotic limbs, which users rated as significantly more natural, comfortable, and less physically demanding than baseline methods.

Authors:Xiang Zhang, Zhou Li, Kai Wan, Hua Sun, Mingyue Ji, Giuseppe Caire
Title: Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
Abstract:
Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer hierarchical network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users' inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to $B$ consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly non-trivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.
中文摘要:本文提出了一种分层安全聚合方案,允许用户以循环方式连接多个中继,通过创新的梯度编码和密钥设计提升联邦学习中的通信效率与安全性,并建立了信息论下的性能极限新边界。
English Summary: This paper introduces a hierarchical secure aggregation scheme for federated learning that enables each user to connect with multiple relays cyclically, enhancing communication efficiency and security through innovative gradient coding and key designs, while establishing new information-theoretic bounds on performance.

Authors:Zi Wang, Shiyi Lan, Xinglong Sun, Nadine Chang, Zhenxin Li, Zhiding Yu, Jose M. Alvarez
Title: Enhancing Autonomous Driving Safety with Collision Scenario Integration
Abstract:
Autonomous vehicle safety is crucial for the successful deployment of self-driving cars. However, most existing planning methods rely heavily on imitation learning, which limits their ability to leverage collision data effectively. Moreover, collecting collision or near-collision data is inherently challenging, as it involves risks and raises ethical and practical concerns. In this paper, we propose SafeFusion, a training framework to learn from collision data. Instead of over-relying on imitation learning, SafeFusion integrates safety-oriented metrics during training to enable collision avoidance learning. In addition, to address the scarcity of collision data, we propose CollisionGen, a scalable data generation pipeline to generate diverse, high-quality scenarios using natural language prompts, generative models, and rule-based filtering. Experimental results show that our approach improves planning performance in collision-prone scenarios by 56\% over previous state-of-the-art planners while maintaining effectiveness in regular driving situations. Our work provides a scalable and effective solution for advancing the safety of autonomous driving systems.
中文: SafeFusion框架通过整合安全指标和利用CollisionGen生成碰撞数据,显著提升自动驾驶安全性,在高风险场景中的规划性能提高56%。
English: The SafeFusion framework enhances autonomous vehicle safety by integrating safety metrics and generating collision data via CollisionGen, improving planning performance by 56% in high-risk scenarios.

Authors:Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong
Title: Process-based Self-Rewarding Language Models
Abstract:
Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.
中文: 本文提出基于过程的自奖励流程,通过长思维推理和分步评估的迭代自奖励机制,成功提升大语言模型在数学推理任务中的表现,显示出超越人类能力的巨大潜力。
English: This paper introduces a Process-based Self-Rewarding pipeline that enhances large language models' mathematical reasoning through iterative self-rewarding with long-thought reasoning and step-wise evaluation, demonstrating potential to surpass human capabilities.

Authors:Xuehui Dong, Kai Wan, Shuangyang Li, Robert Caiming Qiu, Giuseppe Caire
Title: Frequency-Space Channel Estimation and Spatial Equalization in Wideband Fluid Antenna System
Abstract:
The Fluid Antenna System (FAS) overcomes the spatial degree-of-freedom limitations of conventional static antenna arrays in wireless communications.This capability critically depends on acquiring full Channel State Information across all accessible ports. Existing studies focus exclusively on narrowband FAS, performing channel estimation solely in the spatial domain. This work proposes a channel estimation and spatial equalization framework for wideband FAS, revealing for the first time an inherent group-sparse structure in aperture-limited FAS channels. First, we establish a group-sparse recovery framework for space-frequency characteristics in FAS, formally characterizing leakage-induced sparsity degradation from limited aperture and bandwidth as a structured group-sparsity problem. By deriving dictionary-adapted group restricted isometry property, we prove tight recovery bounds for a convex $\ell_1/\ell_2$-mixed norm optimization formulation that preserves leakage-aware sparsity patterns. Second, we develop a descending correlation group orthogonal matching pursuit algorithm that systematically relaxes leakage constraints to reduce subcoherence. This approach enables FSC recovery with accelerated convergence and superior performance compared to conventional compressive sensing methods like OMP or GOMP. Third, we formulate spatial equalization as a mixed-integer linear programming problem, complement this with a greedy algorithm maintaining near-optimal performance. Simulation results demonstrate the proposed channel estimation algorithm effectively resolves energy misallocation and enables recovery of weak details, achieving superior recovery accuracy and convergence rate. The SE framework suppresses deep fading phenomena and largely reduces time consumption overhead while maintaining equivalent link reliability.
中文摘要:本文针对宽带流体天线系统提出了一种信道估计与空间均衡框架,利用固有的群稀疏结构实现卓越的恢复精度和加速收敛,同时有效抑制深度衰落现象。
English Summary: This paper introduces a channel estimation and spatial equalization framework for wideband fluid antenna systems, leveraging inherent group-sparse structures to achieve superior recovery accuracy and accelerated convergence while effectively suppressing deep fading.

Authors:Yitao Zhu, Yuan Yin, Jiaming Li, Mengjie Xu, Zihao Zhao, Honglin Xiong, Sheng Wang, Qian Wang
Title: Med-LEGO: Editing and Adapting toward Generalist Medical Image Diagnosis
Abstract:
The adoption of visual foundation models has become a common practice in computer-aided diagnosis (CAD). While these foundation models provide a viable solution for creating generalist medical AI, privacy concerns make it difficult to pre-train or continuously update such models across multiple domains and datasets, leading many studies to focus on specialist models. To address this challenge, we propose Med-LEGO, a training-free framework that enables the seamless integration or updating of a generalist CAD model by combining multiple specialist models, similar to assembling LEGO bricks. Med-LEGO enhances LoRA (low-rank adaptation) by incorporating singular value decomposition (SVD) to efficiently capture the domain expertise of each specialist model with minimal additional parameters. By combining these adapted weights through simple operations, Med-LEGO allows for the easy integration or modification of specific diagnostic capabilities without the need for original data or retraining. Finally, the combined model can be further adapted to new diagnostic tasks, making it a versatile generalist model. Our extensive experiments demonstrate that Med-LEGO outperforms existing methods in both cross-domain and in-domain medical tasks while using only 0.18% of full model parameters. These merged models show better convergence and generalization to new tasks, providing an effective path toward generalist medical AI.
中文摘要:Med-LEGO是一种无需训练的框架,通过改进的LoRA与SVD技术将多个专科医疗AI模型像乐高积木般组合,仅用0.18%参数即可构建多功能通用模型,在跨领域医疗任务中表现优异。
English Summary: Med-LEGO is a training-free framework that integrates multiple specialist medical AI models like LEGO bricks, using enhanced LoRA with SVD to create versatile generalist models with minimal parameters while outperforming existing methods in medical tasks.

Authors:Heming Xia, Cunxiao Du, Yongqi Li, Qian Liu, Wenjie Li
Title: Tutorial Proposal: Speculative Decoding for Efficient LLM Inference
Abstract:
This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative decoding paradigm to mitigate the high inference latency stemming from autoregressive decoding in LLMs. At each decoding step, SD efficiently drafts several future tokens and then verifies them in parallel. This approach, unlike traditional autoregressive decoding, facilitates the simultaneous decoding of multiple tokens per step, thereby achieving promising 2x-4x speedups in LLM inference while maintaining original distributions. This tutorial delves into the latest techniques in SD, including draft model architectures and verification strategies. Additionally, it explores the acceleration potential and future research directions in this promising field. We aim for this tutorial to elucidate the current research landscape and offer insights for researchers interested in Speculative Decoding, ultimately contributing to more efficient LLM inference.
中文: 本教程介绍了推测解码技术,该技术通过并行草拟和验证多个标记来加速大语言模型推理,在保持输出质量的同时实现2-4倍加速,并探讨了其最新进展和未来研究方向。
English: This tutorial introduces Speculative Decoding, a technique that accelerates LLM inference by drafting and verifying multiple tokens in parallel, achieving 2x-4x speedups while preserving output quality, and explores its latest advancements and future research directions.

Authors:Zhengyi Zhao, Shubo Zhang, Bin Liang, Binyang Li, Kam-Fai Wong
Title: WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation
Abstract:
In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.
中文: 本文提出了一种基于原理的生物医学专用合成数据增强方法,通过生物关系相似性计算和多智能体反思机制生成与原始语境强相关的增强数据,在多项BioNLP任务中实现了持续的性能提升。
English: This paper introduces a rationale-based synthetic data augmentation method for biomedical NLP that uses bio-relation similarity and multi-agent reflection to generate meaningful augmented data, achieving consistent performance improvements across multiple BioNLP tasks.

Authors:Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, Ninghao Liu
Title: Towards Trustworthy GUI Agents: A Survey
Abstract:
GUI agents, powered by large foundation models, can interact with digital interfaces, enabling various applications in web automation, mobile navigation, and software testing. However, their increasing autonomy has raised critical concerns about their security, privacy, and safety. This survey examines the trustworthiness of GUI agents in five critical dimensions: security vulnerabilities, reliability in dynamic environments, transparency and explainability, ethical considerations, and evaluation methodologies. We also identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making, and a lack of realistic evaluation benchmarks. These issues not only hinder real-world deployment but also call for comprehensive mitigation strategies beyond task success. As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential. This survey provides a foundation for advancing trustworthy GUI agents through systematic understanding and future research.
中文摘要:基于大模型的图形用户界面代理虽能自动化操作数字界面,但其在安全性、可靠性和伦理方面存在显著信任隐患,亟需建立严格的安全标准以保障实际应用。
English Summary: GUI agents, driven by large models, automate digital tasks but face significant trustworthiness challenges in security, reliability, and ethics, requiring robust safety standards for real-world deployment.

Authors:Zhengyi Zhao, Shubo Zhang, Yiming Du, Bin Liang, Baojun Wang, Zhongyang Li, Binyang Li, Kam-Fai Wong
Title: EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems
Abstract:
Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbf{incomplete context tracking}. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this gap, we present \textbf{EventWeave}, an event-centric framework that identifies and updates both core and supporting events as the conversation unfolds. Specifically, we organize these events into a dynamic event graph, which represents the interplay between \textbf{core events} that shape the primary idea and \textbf{supporting events} that provide critical context during the whole dialogue. By leveraging this dynamic graph, EventWeave helps models focus on the most relevant events when generating responses, thus avoiding repeated visits of the entire dialogue history. Experimental results on two benchmark datasets show that EventWeave improves response quality and event relevance without fine-tuning.
中文:EventWeave提出了一种以事件为中心的框架,通过动态图结构追踪核心与辅助事件,无需微调即可提升对话连贯性和事件相关性。
English: EventWeave introduces an event-centric framework that dynamically tracks core and supporting events through a graph structure to enhance dialogue coherence and relevance without requiring fine-tuning.

Authors:Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Bin Liang, Binyang Li, Kam-Fai Wong
Title: FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering
Abstract:
Long-context question-answering (LCQA) systems have greatly benefited from the powerful reasoning capabilities of large language models (LLMs), which can be categorized into slow and quick reasoning modes. However, both modes have their limitations. Slow thinking generally leans to explore every possible reasoning path, which leads to heavy overthinking and wastes time. Quick thinking usually relies on pattern matching rather than truly understanding the query logic, which misses proper understanding. To address these issues, we propose FReM: Flexible Reasoning Mechanism, a method that adjusts reasoning depth according to the complexity of each question. Specifically, FReM leverages synthetic reference QA examples to provide an explicit chain of thought, enabling efficient handling of simple queries while allowing deeper reasoning for more complex ones. By doing so, FReM helps quick-thinking models move beyond superficial pattern matching and narrows the reasoning space for slow-thinking models to avoid unnecessary exploration. Experiments on seven QA datasets show that FReM improves reasoning accuracy and scalability, particularly for complex multihop questions, indicating its potential to advance LCQA methodologies.
Chinese: FReM是一种灵活推理机制,根据问题复杂度调整推理深度,利用合成参考示例优化快速与慢速推理模式,从而提升长上下文问答系统的准确性和可扩展性。
English: FReM is a flexible reasoning mechanism that adjusts the depth of reasoning based on question complexity, using synthetic reference examples to enhance both quick and slow reasoning modes, thereby improving accuracy and scalability in long-context question-answering systems.

Authors:Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, Xiaoning Du
Title: Identifying and Mitigating API Misuse in Large Language Models
Abstract:
API misuse in code generated by large language models (LLMs) represents a serious emerging challenge in software development. While LLMs have demonstrated impressive code generation capabilities, their interactions with complex library APIs remain highly prone to errors, potentially leading to software failures and security vulnerabilities. This paper presents the first comprehensive study of API misuse patterns in LLM-generated code, analyzing both method selection and parameter usage across Python and Java. Through extensive manual annotation of 3,892 method-level and 2,560 parameter-level misuses, we develop a novel taxonomy of four distinct API misuse types specific to LLMs, which significantly differ from traditional human-centric misuse patterns. Our evaluation of two widely used LLMs, StarCoder-7B (open-source) and Copilot (closed-source), reveals significant challenges in API usage, particularly in areas of hallucination and intent misalignment. We propose Dr.Fix, a novel LLM-based automatic program repair approach for API misuse based on the aforementioned taxonomy. Our method substantially improves repair accuracy for real-world API misuse, demonstrated by increases of up to 38.4 points in BLEU scores and 40 percentage points in exact match rates across different models and programming languages. This work provides crucial insights into the limitations of current LLMs in API usage and presents an effective solution for the automated repair of API misuse in LLM-generated code.
中文摘要:本研究揭示了大型语言模型生成代码中特有的API误用模式,并提出了Dr.Fix自动修复方法,在不同编程语言中显著提升了修复准确率。
English Summary: This study identifies unique API misuse patterns in LLM-generated code and introduces Dr.Fix, an automated repair method that significantly improves repair accuracy across programming languages.

Authors:Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, An Fu
Title: CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs' abilities to generate code that strictly follows users' instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both \textit{Static Conversation} and \textit{Dynamic Conversation} settings, we evaluate the performance of 7 state-of-the-art LLMs and summarize the important factors influencing the instruction-following ability of LLMs in multi-turn interactions, as well as potential directions for improvement.
Chinese: CodeIF-Bench 是一个新的基准测试,旨在评估大语言模型在多轮代码生成中遵循指令的能力,弥补了现有基准仅关注单轮功能正确性的不足。
English: CodeIF-Bench is a new benchmark designed to evaluate how well Large Language Models follow instructions in multi-turn code generation, addressing the limitations of existing benchmarks that focus only on single-turn functional correctness.

Authors:Wencheng Han, Dongqian Guo, Xiao Chen, Pang Lyu, Yi Jin, Jianbing Shen
Title: Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data
Abstract:
Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: https://um-lab.github.io/GSI-MAR/
中文摘要:潜在宝石光谱成像对齐框架通过将普通CT图像表征与先进GSI序列对齐,有效减少CT扫描中的金属伪影,显著提升图像清晰度且不引入额外噪声。
English Summary: The Latent Gemstone Spectral Imaging Alignment Framework effectively reduces metal artifacts in CT scans by aligning ordinary CT image representations with advanced GSI sequences, significantly improving image clarity without introducing noise.

Authors:Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie
Title: FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System
Abstract:
In this work, we upgrade FireRedTTS to a new version, FireRedTTS-1S, a high-quality streaming foundation text-to-speech system. FireRedTTS-1S achieves streaming speech generation via two steps: text-to-semantic decoding and semantic-to-acoustic decoding. In text-to-semantic decoding, a semantic-aware speech tokenizer converts the speech signal into semantic tokens, which can be synthesized from the text via a language model in an auto-regressive manner. Meanwhile, the semantic-to-acoustic decoding module simultaneously translates generated semantic tokens into the speech signal in a streaming way. We implement two approaches to achieve this module: 1) a chunk-wise streamable flow-matching approach, and 2) a multi-stream language model-based approach. They both present high-quality and streamable speech generation but differ in real-time factor (RTF) and latency. Specifically, flow-matching decoding can generate speech by chunks, presenting a lower RTF of 0.1 but a higher latency of 300ms. Instead, the multi-stream language model generates speech by frames in an autoregressive manner, presenting a higher RTF of 0.3 but a low latency of 150ms. In experiments on zero-shot voice cloning, the objective results validate FireRedTTS-1S as a high-quality foundation model with comparable intelligibility and speaker similarity over industrial baseline systems. Furthermore, the subjective score of FireRedTTS-1S highlights its impressive synthesis performance, achieving comparable quality to the ground-truth recordings. These results validate FireRedTTS-1S as a high-quality streaming foundation TTS system.
中文:FireRedTTS-1S是一种高质量流式文本转语音系统,通过文本到语义和语义到声学的双重解码实现实时语音生成,在零样本语音克隆中展现出与真实录音相媲美的合成性能。
English: FireRedTTS-1S is a high-quality streaming text-to-speech system that achieves real-time speech generation through text-to-semantic and semantic-to-acoustic decoding, demonstrating performance comparable to ground-truth recordings in voice cloning experiments.

Authors:Fanhu Zeng, Zhen Cheng, Fei Zhu, Xu-Yao Zhang
Title: Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models
Abstract:
Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.
中文: 本文提出FSMisD框架,利用视觉语言模型通过少量样本提示学习实现高效误分类检测,无需从头训练模型,采用自适应伪样本生成和新型负损失函数解决神经网络过度自信问题。
English: This paper introduces FSMisD, a few-shot prompt learning framework that leverages vision language models to efficiently detect misclassifications without requiring full retraining, addressing overconfidence in neural networks through adaptive pseudo sample generation and a novel negative loss function.

Authors:Dahyun Jung, Seungyoon Lee, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Title: FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.
中文: 近期大语言模型的进展凸显了严格安全评估的必要性,为此开发了FLEX基准,通过极端偏见诱导提示测试模型公平性,证明传统评估可能低估了模型的内在风险。
English: Recent LLM advancements highlight the need for rigorous safety evaluations, leading to the creation of the FLEX benchmark to test fairness under extreme bias-inducing prompts, revealing that traditional assessments may underestimate model risks.

Authors:Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie
Title: Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.
中文: 本文提出FTI4CIR模型,通过细粒度文本反演将图像映射为多个伪词标记并进行语义对齐,在零样本组合图像检索任务中显著优于现有方法。
English: This paper introduces FTI4CIR, a fine-grained textual inversion network that enhances zero-shot composed image retrieval by mapping images into multiple pseudo-word tokens and aligning them semantically, outperforming existing methods on benchmark datasets.

Authors:Marco Garosi, Alessandro Conti, Gaowen Liu, Elisa Ricci, Massimiliano Mancini
Title: Compositional Caching for Training-free Open-vocabulary Attribute Detection
Abstract:
Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.
Chinese: 本文提出组合式缓存(ComCa)方法,这是一种无需训练即可实现开放词汇属性检测的技术,它通过利用网络数据和大语言模型来优化视觉语言模型的预测,有效克服了预定义属性集和人工标注的局限性,性能与基于训练的方法相当。
English: The paper introduces Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes the limitations of predefined attribute sets and labor-intensive annotations by leveraging web data and large language models to refine vision-language model predictions, achieving competitive performance with training-based approaches.

Authors:Abdul Qayyum, Moona Mazher, Devran Ugurlu, Jose Alonso Solis Lemus, Cristobal Rodero, Steven A Niederer
Title: Foundation Model for Whole-Heart Segmentation: Leveraging Student-Teacher Learning in Multi-Modal Medical Imaging
Abstract:
Whole-heart segmentation from CT and MRI scans is crucial for cardiovascular disease analysis, yet existing methods struggle with modality-specific biases and the need for extensive labeled datasets. To address these challenges, we propose a foundation model for whole-heart segmentation using a self-supervised learning (SSL) framework based on a student-teacher architecture. Our model is pretrained on a large, unlabeled dataset of CT and MRI scans, leveraging the xLSTM backbone to capture long-range spatial dependencies and complex anatomical structures in 3D medical images. By incorporating multi-modal pretraining, our approach ensures strong generalization across both CT and MRI modalities, mitigating modality-specific variations and improving segmentation accuracy in diverse clinical settings. The use of large-scale unlabeled data significantly reduces the dependency on manual annotations, enabling robust performance even with limited labeled data. We further introduce an xLSTM-UNet-based architecture for downstream whole-heart segmentation tasks, demonstrating its effectiveness on few-label CT and MRI datasets. Our results validate the robustness and adaptability of the proposed model, highlighting its potential for advancing automated whole-heart segmentation in medical imaging.
中文摘要:本研究提出了一种基于xLSTM架构的自监督学习基础模型,通过多模态预训练减少对标注数据的依赖,显著提升了CT与MRI影像的全心脏分割跨模态泛化能力。
English Summary: This study introduces a foundation model for whole-heart segmentation using self-supervised learning with an xLSTM backbone, which reduces reliance on labeled data and improves cross-modality generalization between CT and MRI scans.

Authors:Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci
Title: Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Abstract:
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
中文: 视觉语言模型在处理用户特定概念时仍存在困难,因此我们提出R2P,一种无需训练的方法,通过内部知识检索和推理来增强个性化,避免高成本训练。
English: Vision Language Models (VLMs) still struggle with user-specific concepts, so we propose R2P, a training-free method that uses internal knowledge retrieval and reasoning to enhance personalization without costly training.

Authors:Wen-Tse Chen, Minh Nguyen, Zhongyu Li, Guo Ning Sue, Koushil Sreenath
Title: Decentralized Navigation of a Cable-Towed Load using Quadrupedal Robot Team via MARL
Abstract:
This work addresses the challenge of enabling a team of quadrupedal robots to collaboratively tow a cable-connected load through cluttered and unstructured environments while avoiding obstacles. Leveraging cables allows the multi-robot system to navigate narrow spaces by maintaining slack when necessary. However, this introduces hybrid physical interactions due to alternating taut and slack states, with computational complexity that scales exponentially as the number of agents increases. To tackle these challenges, we developed a scalable and decentralized system capable of dynamically coordinating a variable number of quadrupedal robots while managing the hybrid physical interactions inherent in the load-towing task. At the core of this system is a novel multi-agent reinforcement learning (MARL)-based planner, designed for decentralized coordination. The MARL-based planner is trained using a centralized training with decentralized execution (CTDE) framework, enabling each robot to make decisions autonomously using only local (ego) observations. To accelerate learning and ensure effective collaboration across varying team sizes, we introduce a tailored training curriculum for MARL. Experimental results highlight the flexibility and scalability of the framework, demonstrating successful deployment with one to four robots in real-world scenarios and up to twelve robots in simulation. The decentralized planner maintains consistent inference times, regardless of the team size. Additionally, the proposed system demonstrates robustness to environment perturbations and adaptability to varying load weights. This work represents a step forward in achieving flexible and efficient multi-legged robotic collaboration in complex and real-world environments.
中文: 本研究开发了一种去中心化的多智能体强化学习系统,使四足机器人团队能够在杂乱环境中协作拖曳线缆连接的重物,有效处理混合物理交互,并在不同团队规模下保持稳定性能。
English: This study presents a decentralized multi-agent reinforcement learning system that enables teams of quadrupedal robots to collaboratively tow cable-connected loads through cluttered environments while managing hybrid physical interactions and maintaining consistent performance across varying team sizes.

Authors:Zhuoling Li, Hossein Rahmani, Qiuhong Ke, Jun Liu
Title: LongDiff: Training-Free Long Video Generation in One Go
Abstract:
Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
中文: 视频扩散模型在短视频生成中表现出色,但在长视频生成中面临时间一致性和细节保持的挑战,为此提出的LongDiff无需训练即可有效提升现有模型一次性生成长视频的质量。
English: Video diffusion models excel in short video generation but struggle with temporal consistency and detail preservation in long videos, prompting the development of LongDiff, a training-free method that enhances their capability for high-quality, one-shot long video production.

Authors:Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Jun Liu
Title: An Image-like Diffusion Method for Human-Object Interaction Detection
Abstract:
Human-object interaction (HOI) detection often faces high levels of ambiguity and indeterminacy, as the same interaction can appear vastly different across different human-object pairs. Additionally, the indeterminacy can be further exacerbated by issues such as occlusions and cluttered backgrounds. To handle such a challenging task, in this work, we begin with a key observation: the output of HOI detection for each human-object pair can be recast as an image. Thus, inspired by the strong image generation capabilities of image diffusion models, we propose a new framework, HOI-IDiff. In HOI-IDiff, we tackle HOI detection from a novel perspective, using an Image-like Diffusion process to generate HOI detection outputs as images. Furthermore, recognizing that our recast images differ in certain properties from natural images, we enhance our framework with a customized HOI diffusion process and a slice patchification model architecture, which are specifically tailored to generate our recast ``HOI images''. Extensive experiments demonstrate the efficacy of our framework.
中文摘要:本文提出HOI-IDiff框架,通过将人-物交互检测重新定义为图像生成任务,采用定制化扩散过程和专门架构,有效解决了该任务中存在的模糊性和不确定性问题。
English Summary: This paper introduces HOI-IDiff, a novel framework that addresses the ambiguity and indeterminacy in human-object interaction detection by reformulating it as an image generation task using a customized diffusion process and specialized architecture.

Authors:Mahsa Khosravi, Zhanhong Jiang, Joshua R Waite, Sarah Jonesc, Hernan Torres, Arti Singh, Baskar Ganapathysubramanian, Asheesh Kumar Singh, Soumik Sarkar
Title: Optimizing Navigation And Chemical Application in Precision Agriculture With Deep Reinforcement Learning And Conditional Action Tree
Abstract:
This paper presents a novel reinforcement learning (RL)-based planning scheme for optimized robotic management of biotic stresses in precision agriculture. The framework employs a hierarchical decision-making structure with conditional action masking, where high-level actions direct the robot's exploration, while low-level actions optimize its navigation and efficient chemical spraying in affected areas. The key objectives of optimization include improving the coverage of infected areas with limited battery power and reducing chemical usage, thus preventing unnecessary spraying of healthy areas of the field. Our numerical experimental results demonstrate that the proposed method, Hierarchical Action Masking Proximal Policy Optimization (HAM-PPO), significantly outperforms baseline practices, such as LawnMower navigation + indiscriminate spraying (Carpet Spray), in terms of yield recovery and resource efficiency. HAM-PPO consistently achieves higher yield recovery percentages and lower chemical costs across a range of infection scenarios. The framework also exhibits robustness to observation noise and generalizability under diverse environmental conditions, adapting to varying infection ranges and spatial distribution patterns.
中文摘要:本文提出了一种名为HAM-PPO的分层强化学习框架,通过优化机器人导航和精准施药来提升受感染区域的覆盖效率,同时减少电池消耗和化学品使用,在各种环境条件下均展现出比传统方法更优异的产量恢复和资源利用效果。
English Summary: This paper introduces a hierarchical reinforcement learning framework called HAM-PPO that optimizes robotic crop management by improving infected area coverage while minimizing battery and chemical usage, demonstrating superior performance over traditional methods in yield recovery and resource efficiency across various conditions.

Authors:Lingfan Zhang, Chen Liu, Chengming Xu, Kai Hu, Donghao Luo, Chengjie Wang, Yanwei Fu, Yuan Yao
Title: When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
Abstract:
In recent years, the field of image generation has witnessed significant advancements, particularly in fine-tuning methods that align models with universal human preferences. This paper explores the critical role of preference data in the training process of diffusion models, particularly in the context of Diffusion-DPO and its subsequent adaptations. We investigate the complexities surrounding universal human preferences in image generation, highlighting the subjective nature of these preferences and the challenges posed by minority samples in preference datasets. Through pilot experiments, we demonstrate the existence of minority samples and their detrimental effects on model performance. We propose Adaptive-DPO -- a novel approach that incorporates a minority-instance-aware metric into the DPO objective. This metric, which includes intra-annotator confidence and inter-annotator stability, distinguishes between majority and minority samples. We introduce an Adaptive-DPO loss function which improves the DPO loss in two ways: enhancing the model's learning of majority labels while mitigating the negative impact of minority samples. Our experiments demonstrate that this method effectively handles both synthetic minority data and real-world preference data, paving the way for more effective training methodologies in image generation tasks.
Chinese: 本文提出Adaptive-DPO方法,通过引入少数样本感知指标来区分并减轻少数偏好样本的负面影响,从而在图像生成中更好地对齐人类普遍偏好,提升扩散模型的训练效果。
English: This paper introduces Adaptive-DPO, a novel method that enhances diffusion model training by incorporating a minority-aware metric to distinguish and mitigate the negative impact of minority preference samples, thereby improving alignment with universal human preferences in image generation.

Authors:Kendong Liu, Zhiyu Zhu, Hui Liu, Junhui Hou
Title: Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation
Abstract:
We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a $20\times$ increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.
中文: Acc3D通过引入边缘一致性和对抗性增强技术,显著提升了从单张图像生成3D模型的速度与质量,在计算效率上实现超过20倍的提升,并优于现有最优方法。
English: Acc3D accelerates 3D model generation from single images by introducing edge consistency and adversarial augmentation, achieving over 20× computational efficiency and superior quality compared to existing methods.

Authors:Hongda Liu, Longguang Wang, Ye Zhang, Ziru Yu, Yulan Guo
Title: SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer
Abstract:
Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.
中文: SaMam框架采用基于Mamba的模型实现高效的全局感受野,通过局部增强和锯齿扫描解决现有方法的局限性,在图像风格迁移中展现出卓越的准确性和效率优势。
English: The SaMam framework utilizes a Mamba-based model to achieve efficient global receptive fields for high-quality image style transfer, incorporating local enhancement and zigzag scan to overcome limitations of existing methods while demonstrating superior accuracy and efficiency.

Authors:Yue Qiu, Yuqi Tong, Yu Zhang, Qixuan Liu, Jialun Pei, Shi Qiu, Pheng-Ann Heng, Chi-Wing Fu
Title: CvhSlicer 2.0: Immersive and Interactive Visualization of Chinese Visible Human Data in XR Environments
Abstract:
The study of human anatomy through advanced visualization techniques is crucial for medical research and education. In this work, we introduce CvhSlicer 2.0, an innovative XR system designed for immersive and interactive visualization of the Chinese Visible Human (CVH) dataset. Particularly, our proposed system operates entirely on a commercial XR headset, offering a range of visualization and interaction tools for dynamic 2D and 3D data exploration. By conducting comprehensive evaluations, our CvhSlicer 2.0 demonstrates strong capabilities in visualizing anatomical data, enhancing user engagement and improving educational effectiveness. A demo video is available at https://youtu.be/CfR72S_0N-4
中文: CvhSlicer 2.0 是一款创新的扩展现实系统,能在商用头显设备上实现对中国人可视化数据集的沉浸式可视化与交互操作,显著提升了人体解剖学研究的参与度和教学效果。
English: CvhSlicer 2.0 is an innovative XR system that provides immersive visualization and interactive tools for exploring the Chinese Visible Human dataset on commercial headsets, enhancing both user engagement and educational outcomes in anatomy studies.

Authors:Kushagra Gupta, Ross Allen, David Fridovich-Keil, Ufuk Topcu
Title: More Information is Not Always Better: Connections between Zero-Sum Local Nash Equilibria in Feedback and Open-Loop Information Patterns
Abstract:
Non-cooperative dynamic game theory provides a principled approach to modeling sequential decision-making among multiple noncommunicative agents. A key focus has been on finding Nash equilibria in two-agent zero-sum dynamic games under various information structures. A well-known result states that in linear-quadratic games, unique Nash equilibria under feedback and open-loop information structures yield identical trajectories. Motivated by two key perspectives -- (i) many real-world problems extend beyond linear-quadratic settings and lack unique equilibria, making only local Nash equilibria computable, and (ii) local open-loop Nash equilibria (OLNE) are easier to compute than local feedback Nash equilibria (FBNE) -- it is natural to ask whether a similar result holds for local equilibria in zero-sum games. To this end, we establish that for a broad class of zero-sum games with potentially nonconvex-nonconcave objectives and nonlinear dynamics: (i) the state/control trajectory of a local FBNE satisfies local OLNE first-order optimality conditions, and vice versa, (ii) a local FBNE trajectory satisfies local OLNE second-order necessary conditions, (iii) a local FBNE trajectory satisfying feedback sufficiency conditions also constitutes a local OLNE, and (iv) with additional hard constraints on agents' actuations, a local FBNE where strict complementarity holds also satisfies local OLNE first-order optimality conditions, and vice versa.
中文摘要:本文将线性二次博弈中反馈与开环纳什均衡的等价性推广至具有非线性动态的更广泛零和博弈,确立了二者轨迹与最优性条件相互对应的多种情形。
English Summary: This paper extends the equivalence between feedback and open-loop Nash equilibria from linear-quadratic games to broader zero-sum games with nonlinear dynamics, establishing conditions under which their trajectories and optimality conditions align.

Authors:Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, David M. Chan
Title: TULIP: Towards Unified Language-Image Pretraining
Abstract:
Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io
Chinese: TULIP 是一种开源模型,通过生成数据增强和对比学习提升视觉特征学习能力,在多项基准测试中,其零样本和少样本性能均优于现有模型如 CLIP 和 SigLIP。
English: TULIP is an open-source model that enhances visual feature learning through generative data augmentation and contrastive learning, achieving superior performance in zero-shot and few-shot tasks across multiple benchmarks compared to existing models like CLIP and SigLIP.

Authors:Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
Title: VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation
Abstract:
Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
中文:VEGGIE是一个端到端的视频编辑框架,通过整合用户指令解析、视频定位和基于扩散模型的渲染技术,在统一系统中处理多样化编辑任务,在多任务处理场景中展现出卓越性能。
English: VEGGIE is an end-to-end video editing framework that integrates user instruction interpretation, video grounding, and diffusion-based rendering to handle diverse editing tasks within a unified system, demonstrating superior performance in multitasking scenarios.

Authors:Zongyun Zhang, Jiacheng Ruan, Xian Gao, Ting Liu, Yuzhuo Fu
Title: EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models
Abstract:
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.
中文摘要:提出的可解释工业异常检测助手(EIAD)通过多模态缺陷定位模块将对话功能与特征提取解耦,采用独立优化目标和定制学习策略,利用真实可靠的DDQA数据集,在缺陷检测精度和可解释性方面均取得显著提升。
English Summary: The proposed Explainable Industrial Anomaly Detection Assistant (EIAD) introduces a novel multi-modal approach that decouples dialog functions from feature extraction through specialized optimization strategies, achieving superior defect detection accuracy and interpretability while utilizing the authentic DDQA dataset.

Authors:Moises Diaz, Miguel A. Ferrer, Juan M. Gil, Rafael Rodriguez, Peirong Zhang, Lianwen Jin
Title: Online Signature Verification based on the Lagrange formulation with 2D and 3D robotic models
Abstract:
Online Signature Verification commonly relies on function-based features, such as time-sampled horizontal and vertical coordinates, as well as the pressure exerted by the writer, obtained through a digitizer. Although inferring additional information about the writers arm pose, kinematics, and dynamics based on digitizer data can be useful, it constitutes a challenge. In this paper, we tackle this challenge by proposing a new set of features based on the dynamics of online signatures. These new features are inferred through a Lagrangian formulation, obtaining the sequences of generalized coordinates and torques for 2D and 3D robotic arm models. By combining kinematic and dynamic robotic features, our results demonstrate their significant effectiveness for online automatic signature verification and achieving state-of-the-art results when integrated into deep learning models.
Chinese: 本文提出了一种基于拉格朗日公式的新型动态特征,用于在线签名验证,结合深度学习模型中的运动学特征,实现了最先进的性能。
English: This paper introduces a novel set of dynamic features derived from a Lagrangian formulation for online signature verification, which, when combined with kinematic features in deep learning models, achieves state-of-the-art performance.

Authors:David Noever, Forrest McKee
Title: Dueling QR Codes: The Hyding of Dr. Jeckyl
Abstract:
The paper presents a novel technique for encoding dual messages within standard Quick Response (QR) codes through precise half-pixel module splitting. This work challenges fundamental assumptions about deterministic decoding in the ISO/IEC 18004:2015 standard while maintaining complete compatibility with existing QR infrastructure. The proposed two-dimensional barcode attack enables angle-dependent message selection while maintaining compatibility with unmodified QR readers and the 100 million US mobile users who use their phone's built-in scanners. Unlike previous approaches that rely on nested codes, watermarking, or error correction exploitation, our method achieves true one-to-many mapping by manipulating the physical sampling process built into the QR standard. By preserving critical function patterns while bifurcating data modules, we create automated codes that produce different but valid readings based on camera viewing angle. Experimental results demonstrate successful implementation across multiple use cases, including simple message text pairs, complex URLs (nsa.gov/nasa.gov), and security test patterns for malware and spam detectors (EICAR/GTUBE). Our technique achieves reliable dual-message decoding using standard QR readers at module scales of 9-11 pixels, with successful angle-dependent reading demonstrated across vertical, horizontal, and diagonal orientations. The method's success suggests potential applications beyond QR code phishing ('quishing') including two-factor authentication, anti-counterfeiting, and information density optimization. The half-pixel technique may offer future avenues for similar implementations in other 2D barcode formats such as Data Matrix and Aztec Code.
中文摘要:本文提出了一种通过精确的半像素模块分割在标准QR码中嵌入双重信息的新技术,能够在保持与现有QR基础设施完全兼容的同时,实现基于视角选择不同消息的功能。
English Summary: This paper introduces a novel technique for embedding dual messages in QR codes using half-pixel module splitting, enabling angle-dependent message selection while maintaining full compatibility with standard QR readers and infrastructure.

Authors:Xueying Jiang, Wenhao Li, Xiaoqin Zhang, Ling Shao, Shijian Lu
Title: Exploring 3D Reasoning-Driven Planning: From Implicit Human Intentions to Route-Aware Activity Planning
Abstract:
3D task planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advances in multimodal learning. However, most existing studies are facing two common challenges: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. We address the above challenges by proposing 3D Reasoning-Driven Planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.
Chinese: 提出的3D推理驱动规划通过解析隐含用户意图并整合步骤间路径规划,结合新构建的基准数据集和动态场景图框架,有效解决了现有方法在隐性推理与连续动作规划方面的不足。
English: The proposed 3D Reasoning-Driven Planning addresses the limitations of existing methods by reasoning implicit user intentions and incorporating inter-step route planning, supported by a new benchmark and framework that demonstrate superior performance in multi-step task execution.

Authors:Guoliang Xu, Jianqin Yin, Ren Zhang, Yonghao Dang, Feng Zhou, Bo Yu
Title: L2HCount:Generalizing Crowd Counting from Low to High Crowd Density via Density Simulation
Abstract:
Since COVID-19, crowd-counting tasks have gained wide applications. While supervised methods are reliable, annotation is more challenging in high-density scenes due to small head sizes and severe occlusion, whereas it's simpler in low-density scenes. Interestingly, can we train the model in low-density scenes and generalize it to high-density scenes? Therefore, we propose a low- to high-density generalization framework (L2HCount) that learns the pattern related to high-density scenes from low-density ones, enabling it to generalize well to high-density scenes. Specifically, we first introduce a High-Density Simulation Module and a Ground-Truth Generation Module to construct fake high-density images along with their corresponding ground-truth crowd annotations respectively by image-shifting technique, effectively simulating high-density crowd patterns. However, the simulated images have two issues: image blurring and loss of low-density image characteristics. Therefore, we second propose a Head Feature Enhancement Module to extract clear features in the simulated high-density scene. Third, we propose a Dual-Density Memory Encoding Module that uses two crowd memories to learn scene-specific patterns from low- and simulated high-density scenes, respectively. Extensive experiments on four challenging datasets have shown the promising performance of L2HCount.
中文摘要:L2HCount框架通过高密度场景模拟、特征增强和双密度记忆编码,实现了在低密度场景训练的模型向高密度人群计数任务的有效迁移。
English Summary: The L2HCount framework enables crowd-counting models trained on low-density scenes to generalize effectively to high-density scenes through high-density simulation, feature enhancement, and dual-density memory encoding.

Authors:Lingyi Wang, Wei Wu, Fuhui Zhou, Zhijin Qin, Qihui Wu
Title: Cross-Layer Security for Semantic Communications: Metrics and Optimization
Abstract:
Different from traditional secure communication that focuses on symbolic protection at the physical layer, semantic secure communication requires further attention to semantic-level task performance at the application layer. There is a research gap on how to comprehensively evaluate and optimize the security performance of semantic communication. In order to fill this gap, a unified semantic security metric, the cross-layer semantic secure rate (CLSSR), is defined to estimate cross-layer security requirements at both the physical layer and the application layer. Then, we formulate the maximization problem of the CLSSR with the mixed integer nonlinear programming (MINLP). We propose a hierarchical AI-native semantic secure communication network with a reinforcement learning (RL)-based semantic resource allocation scheme, aiming to ensure the cross-layer semantic security (CL-SS). Finally, we prove the convergence of our proposed intelligent resource allocation, and the simulation results demonstrate that our proposed CLSS method outperforms the traditional physical layer semantic security (PL-SS) method in terms of both task reliability and CLSSR.
中文摘要:本研究提出了跨层语义安全率(CLSSR)指标和基于强化学习的资源分配方案,以增强跨物理层和应用层的语义通信安全性,实验证明其在任务可靠性和安全指标上均优于传统方法。
English Summary: This study introduces a cross-layer semantic secure rate (CLSSR) metric and a reinforcement learning-based resource allocation scheme to enhance semantic communication security across physical and application layers, demonstrating superior performance over traditional methods in both task reliability and security metrics.

Authors:Ying Zang, Yuncan Gao, Jiangi Zhang, Yuangi Hu, Runlong Cao, Lanyun Zhu, Qi Zhu, Deyi Ji, Renjun Xu, Tianrun Chen
Title: Breaking the Box: Enhancing Remote Sensing Image Segmentation with Freehand Sketches
Abstract:
This work advances zero-shot interactive segmentation for remote sensing imagery through three key contributions. First, we propose a novel sketch-based prompting method, enabling users to intuitively outline objects, surpassing traditional point or box prompts. Second, we introduce LTL-Sensing, the first dataset pairing human sketches with remote sensing imagery, setting a benchmark for future research. Third, we present LTL-Net, a model featuring a multi-input prompting transport module tailored for freehand sketches. Extensive experiments show our approach significantly improves segmentation accuracy and robustness over state-of-the-art methods like SAM, fostering more intuitive human-AI collaboration in remote sensing analysis and enhancing its applications.
中文摘要:本研究通过提出基于草图的提示方法、首个结合手绘草图的遥感数据集LTL-Sensing及专用模型LTL-Net,显著提升了遥感图像零样本交互分割的准确性和鲁棒性,优于现有先进技术如SAM。
English Summary: This study enhances zero-shot interactive segmentation in remote sensing by introducing a sketch-based prompting method, the LTL-Sensing dataset with human sketches, and the LTL-Net model, which collectively improve segmentation accuracy and robustness over existing methods like SAM.

Authors:Walter Zimmer, Ross Greer, Daniel Lehmberg, Marc Pavel, Holger Caesar, Xingcheng Zhou, Ahmed Ghita, Mohan Trivedi, Rui Song, Hu Cao, Akshay Gopalkrishnan, Alois C. Knoll
Title: Towards Vision Zero: The TUM Traffic Accid3nD Dataset
Abstract:
Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as unavoidable and sporadic outcomes of traffic networks. No public dataset contains 3D annotations of real-world accidents recorded from roadside camera and LiDAR sensors. We present the TUM Traffic Accid3nD (TUMTraf-Accid3nD) dataset, a collection of real-world highway accidents in different weather and lighting conditions. It contains vehicle crashes at high-speed driving with 2,634,233 labeled 2D bounding boxes, instance masks, and 3D bounding boxes with track IDs. In total, the dataset contains 111,945 labeled image and point cloud frames recorded from four roadside cameras and LiDARs at 25 Hz. The dataset contains six object classes and is provided in the OpenLABEL format. We propose an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our website: https://accident-dataset.github.io.
中文: TUMTraf-Accid3D数据集首次公开了从路侧摄像头和激光雷达采集的真实高速公路事故三维标注数据,包含260多万个多条件标注样本,并提出经实验验证的混合事故检测模型。
English: The TUMTraf-Accid3D dataset introduces the first publicly available 3D-annotated collection of real-world highway accidents from roadside cameras and LiDARs, featuring over 2.6 million labels across diverse conditions, and proposes a hybrid accident detection model validated through experiments.

Authors:Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang
Title: V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
Abstract:
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.
中文: V-Stylist通过多模态大语言模型的协作与反思机制,提出一种多智能体系统,能有效处理复杂过渡视频并精准解析用户开放查询中的风格偏好,实现了视频风格化的重大突破。
English: V-Stylist introduces a multi-agent system using multimodal large language models to effectively stylize videos with complex transitions and interpret open-ended user queries, achieving state-of-the-art performance.

Authors:Abhishek Moitra, Arkapravo Ghosh, Shrey Agarwal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda
Title: MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs
Abstract:
The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.
中文: MEADOW框架通过创新的令牌并行头顺序数据流和权重打包技术,显著减少大型语言模型的片外存储器访问,在低功耗设备上相比传统GEMM实现实现了显著的延迟改善。
English: MEADOW is a framework that reduces off-chip memory access for LLMs through a token-parallel head-sequential dataflow and weight packing, achieving significant latency improvements on low-power devices compared to traditional GEMM-based implementations.

Authors:Matteo Farina, Massimiliano Mancini, Giovanni Iacca, Elisa Ricci
Title: Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
Abstract:
An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the ``base'' classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.
Chinese: 本文提出两阶段少样本适应方法(2SFS),通过分离特征提取和分类器训练来改进大型视觉语言模型的少样本学习,在多个数据集和设置中实现了最先进的性能。
English: This paper introduces Two-Stage Few-Shot Adaptation (2SFS), a method that separates feature extraction and classifier training to improve few-shot learning with large vision-language models, achieving state-of-the-art results across multiple datasets and settings.

Authors:Shaofeng Liang, Runwei Guan, Wangwang Lian, Daizong Liu, Xiaolou Sun, Dongming Wu, Yutao Yue, Weiping Ding, Hui Xiong
Title: Cognitive Disentanglement for Referring Multi-Object Tracking
Abstract:
As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Specifically, our framework first establishes cross-modal connections while preserving modality-specific characteristics. It then disentangles language descriptions and hierarchically injects them into object queries, refining object understanding from coarse to fine-grained semantic levels. Finally, we reconstruct language representations based on visual features, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.
中文:本文提出的CDRMT框架通过解构语言描述为分层语义组件并与视觉特征融合,显著提升了指代多目标追踪性能,在多个基准数据集上实现了突破性进展。
English: The proposed CDRMT framework enhances Referring Multi-Object Tracking by disentangling language descriptions into hierarchical semantic components and integrating them with visual features, achieving significant performance improvements on benchmark datasets.

Authors:Ke Wang, Lei He, Kun Liu, Yan Deng, Wenning Wei, Sheng Zhao
Title: Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment
Abstract:
Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
中文: 本研究通过使用Speechocean762数据集证明,大型多模态模型(特别是GPT-4o)在发音评估中能有效实现多维度评分和反馈生成,与传统方法结合展现出良好应用前景。
English: This study demonstrates that Large Multimodal Models, particularly GPT-4o, effectively assess pronunciation through multi-level scoring and feedback generation when evaluated on the Speechocean762 dataset, showing promise when combined with traditional methods.

Authors:Xiangyu Yin, Yi Qi, Jinwei Hu, Zhen Chen, Yi Dong, Xingyu Zhao, Xiaowei Huang, Wenjie Ruan
Title: TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models
Abstract:
Vision Language Models (VLMs) have demonstrated impressive inference capabilities, but remain vulnerable to jailbreak attacks that can induce harmful or unethical responses. Existing defence methods are predominantly white-box approaches that require access to model parameters and extensive modifications, making them costly and impractical for many real-world scenarios. Although some black-box defences have been proposed, they often impose input constraints or require multiple queries, limiting their effectiveness in safety-critical tasks such as autonomous driving. To address these challenges, we propose a novel black-box defence framework called \textbf{T}extual \textbf{A}nchoring for \textbf{I}mmunizing \textbf{J}ailbreak \textbf{I}mages (\textbf{TAIJI}). TAIJI leverages key phrase-based textual anchoring to enhance the model's ability to assess and mitigate the harmful content embedded within both visual and textual prompts. Unlike existing methods, TAIJI operates effectively with a single query during inference, while preserving the VLM's performance on benign tasks. Extensive experiments demonstrate that TAIJI significantly enhances the safety and reliability of VLMs, providing a practical and efficient solution for real-world deployment.
中文摘要:提出的TAIJI框架通过基于关键短语的文本锚定技术,为视觉语言模型提供了一种新颖的黑盒防御方法,能够有效检测和缓解恶意内容,仅需单次查询即可运行,同时保持模型在正常任务上的性能。
English Summary: The proposed TAIJI framework provides a novel black-box defense against jailbreak attacks in Vision Language Models by using textual anchoring to detect and mitigate harmful content, requiring only a single query while maintaining performance on benign tasks.

Authors:Lukas Aichberger, Alasdair Paren, Yarin Gal, Philip Torr, Adel Bibi
Title: Attacking Multimodal OS Agents with Malicious Image Patches
Abstract:
Recent advances in operating system (OS) agents enable vision-language models to interact directly with the graphical user interface of an OS. These multimodal OS agents autonomously perform computer-based tasks in response to a single prompt via application programming interfaces (APIs). Such APIs typically support low-level operations, including mouse clicks, keyboard inputs, and screenshot captures. We introduce a novel attack vector: malicious image patches (MIPs) that have been adversarially perturbed so that, when captured in a screenshot, they cause an OS agent to perform harmful actions by exploiting specific APIs. For instance, MIPs embedded in desktop backgrounds or shared on social media can redirect an agent to a malicious website, enabling further exploitation. These MIPs generalise across different user requests and screen layouts, and remain effective for multiple OS agents. The existence of such attacks highlights critical security vulnerabilities in OS agents, which should be carefully addressed before their widespread adoption.
中文: 新型操作系统代理通过视觉语言模型和API执行任务,但易受恶意图像补丁攻击,这些补丁能利用API触发有害行为,如重定向至恶意网站,凸显了在广泛采用前必须解决的安全漏洞。
English: Recent OS agents that use vision-language models to perform computer tasks via APIs are vulnerable to malicious image patches (MIPs), which can exploit APIs to trigger harmful actions like redirecting to malicious websites, highlighting critical security risks before widespread adoption.

Authors:Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Jialing Tong, Xinze Wang, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
Title: DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation
Abstract:
In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.
中文摘要:本研究发现标准扩散变换器在文本生成图像任务中与专业模型性能相当且参数效率更高,并通过架构优化和训练策略推出的DiT-Air模型实现了最先进的性能表现。
English Summary: This study finds that standard Diffusion Transformers (DiTs) match the performance of specialized text-to-image models while being more parameter-efficient, and introduces DiT-Air models that achieve state-of-the-art results through optimized architecture and training.

Authors:Liming Wu, Wenbing Huang, Rui Jiao, Jianxing Huang, Liwei Liu, Yipeng Zhou, Hao Sun, Yang Liu, Fuchun Sun, Yuxiang Ren, Jirong Wen
Title: Siamese Foundation Models for Crystal Structure Prediction
Abstract:
Crystal Structure Prediction (CSP), which aims to generate stable crystal structures from compositions, represents a critical pathway for discovering novel materials. While structure prediction tasks in other domains, such as proteins, have seen remarkable progress, CSP remains a relatively underexplored area due to the more complex geometries inherent in crystal structures. In this paper, we propose Siamese foundation models specifically designed to address CSP. Our pretrain-finetune framework, named DAO, comprises two complementary foundation models: DAO-G for structure generation and DAO-P for energy prediction. Experiments on CSP benchmarks (MP-20 and MPTS-52) demonstrate that our DAO-G significantly surpasses state-of-the-art (SOTA) methods across all metrics. Extensive ablation studies further confirm that DAO-G excels in generating diverse polymorphic structures, and the dataset relaxation and energy guidance provided by DAO-P are essential for enhancing DAO-G's performance. When applied to three real-world superconductors ($\text{CsV}_3\text{Sb}_5$, $ \text{Zr}_{16}\text{Rh}_8\text{O}_4$ and $\text{Zr}_{16}\text{Pd}_8\text{O}_4$) that are known to be challenging to analyze, our foundation models achieve accurate critical temperature predictions and structure generations. For instance, on $\text{CsV}_3\text{Sb}_5$, DAO-G generates a structure close to the experimental one with an RMSE of 0.0085; DAO-P predicts the $T_c$ value with high accuracy (2.26 K vs. the ground-truth value of 2.30 K). In contrast, conventional DFT calculators like Quantum Espresso only successfully derive the structure of the first superconductor within an acceptable time, while the RMSE is nearly 8 times larger, and the computation speed is more than 1000 times slower. These compelling results collectively highlight the potential of our approach for advancing materials science research and development.
Chinese: 本文提出的孪生基础模型DAO框架在晶体结构预测领域表现卓越,通过结构生成与能量预测的双重优化,在基准测试和实际超导材料案例中均显著超越现有方法,展现出推动材料科学发展的巨大潜力。
English: This paper introduces DAO, a Siamese foundation model framework for crystal structure prediction (CSP) that significantly outperforms existing methods in generating accurate structures and predicting properties, as demonstrated on benchmark datasets and challenging superconductors.

Authors:Yunxiao Wang, Meng Liu, Wenqi Liu, Xuemeng Song, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Guorui Zhou, Liqiang Nie
Title: TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs
Abstract:
Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.
中文: 我们通过引入专门的指令数据集和多任务提示微调方法,显著提升了视频大语言模型的时间理解能力,并通过消除评估捷径的新基准验证了其有效性。
English: Our method significantly improves video-LLMs' temporal understanding by introducing a specialized instruction dataset and multi-task prompt fine-tuning, validated through a novel benchmark that eliminates evaluation shortcuts.

Authors:Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami
Title: Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Abstract:
Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
中文: 大语言模型在处理复杂多步骤任务时存在局限,因此Plan-and-Act框架通过规划器生成结构化高层计划、执行器转化为具体操作,并采用合成数据训练方法,在网络导航任务中实现了最先进的成功率。
English: Large language models struggle with complex multi-step tasks, so the Plan-and-Act framework introduces a Planner for structured high-level planning and an Executor for action implementation, achieving state-of-the-art success rates in web navigation benchmarks through synthetic data training.

Authors:Yaorui Shi, Jiaqi Yang, Changhao Nai, Sihang Li, Junfeng Fang, Xiang Wang, Zhiyuan Liu, Yang Zhang
Title: Language-Enhanced Representation Learning for Single-Cell Transcriptomics
Abstract:
Single-cell RNA sequencing (scRNA-seq) offers detailed insights into cellular heterogeneity. Recent advancements leverage single-cell large language models (scLLMs) for effective representation learning. These models focus exclusively on transcriptomic data, neglecting complementary biological knowledge from textual descriptions. To overcome this limitation, we propose scMMGPT, a novel multimodal framework designed for language-enhanced representation learning in single-cell transcriptomics. Unlike existing methods, scMMGPT employs robust cell representation extraction, preserving quantitative gene expression data, and introduces an innovative two-stage pre-training strategy combining discriminative precision with generative flexibility. Extensive experiments demonstrate that scMMGPT significantly outperforms unimodal and multimodal baselines across key downstream tasks, including cell annotation and clustering, and exhibits superior generalization in out-of-distribution scenarios.
中文: scMMGPT框架通过融合单细胞转录组数据与文本知识,采用两阶段预训练策略,在细胞注释和聚类等任务中显著优于现有方法,并展现出优异的泛化能力。
English: The proposed scMMGPT framework integrates multimodal learning with single-cell transcriptomics, combining quantitative gene expression data and textual knowledge through a two-stage pre-training strategy to achieve superior performance in cell annotation and clustering tasks.

Authors:Zhoutong Ye, Mingze Sun, Huan-ang Gao, Chun Yu, Yuanchun Shi
Title: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Abstract:
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs' capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.
大型多模态模型在视觉语言任务中展现出潜力,但在需要整合多种基础能力的复杂任务上仍与人类表现存在显著差距,MOAT基准测试显示人类准确率达82.7%而最佳模型仅为38.8%。
Large multimodal models show promise in vision-language tasks but still lag significantly behind human performance on complex tasks requiring integrated capabilities, as demonstrated by the MOAT benchmark where humans achieved 82.7% accuracy versus the top model's 38.8%.

Authors:Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li
Title: Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Abstract:
Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.
中文摘要:本文提出Cockatiel三阶段训练框架,通过整合合成训练与人类偏好对齐来改进视频精细描述任务,有效解决了现有模型的能力偏差和人类偏好失准问题,在维度平衡性能和人类评估方面均实现了最优表现。
English Summary: This paper introduces Cockatiel, a three-stage training pipeline that enhances Video Detailed Captioning by combining synthetic and human-aligned training to overcome model biases and misalignment with human preferences, achieving state-of-the-art results in both balanced performance and human evaluation.

Authors:Xian Gao, Zongyun Zhang, Ting Liu, Yuzhuo Fu
Title: GoAI: Enhancing AI Students' Learning Paths and Idea Generation via Graph of AI Ideas
Abstract:
With the rapid advancement of artificial intelligence technology, AI students are confronted with a significant "information-to-innovation" gap: they must navigate through the rapidly expanding body of literature, trace the development of a specific research field, and synthesize various techniques into feasible innovative concepts. An additional critical step for students is to identify the necessary prerequisite knowledge and learning paths. Although many approaches based on large language models (LLMs) can summarize the content of papers and trace the development of a field through citations, these methods often overlook the prerequisite knowledge involved in the papers and the rich semantic information embedded in the citation relationships between papers. Such information reveals how methods are interrelated, built upon, extended, or challenged. To address these limitations, we propose GoAI, a tool for constructing educational knowledge graphs from AI research papers that leverages these graphs to plan personalized learning paths and support creative ideation. The nodes in the knowledge graph we have built include papers and the prerequisite knowledge, such as concepts, skills, and tools, that they involve; the edges record the semantic information of citations. When a student queries a specific paper, a beam search-based path search method can trace the current development trends of the field from the queried paper and plan a learning path toward cutting-edge objectives. The integrated Idea Studio guides students to clarify problem statements, compare alternative designs, and provide formative feedback on novelty, clarity, feasibility, and alignment with learning objectives.
中文: GoAI通过构建AI研究论文的教育知识图谱,整合先验知识与引用语义,利用路径搜索方法规划个性化学习路径,并通过创意工坊指导学生实现从知识掌握到创新构思的跨越。
English: GoAI addresses the AI "information-to-innovation" gap by constructing educational knowledge graphs from research papers to map prerequisite knowledge and citation semantics, enabling personalized learning path planning and creative ideation support through its Idea Studio.

Authors:Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews
Abstract:
Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.
Chinese: 本文提出ReviewAgents框架,利用大语言模型自动生成学术论文评审意见,并开发了相关数据集和基准测试,证明其缩小了与人工评审的差距。
English: This paper introduces ReviewAgents, a framework using large language models to automate academic paper reviews, along with a novel dataset and benchmark, showing it narrows the performance gap with human reviewers.

Authors:Runling Long, Yunlong Wang, Jia Wan, Xiang Deng, Xinting Zhu, Weili Guan, Antoni B. Chan, Liqiang Nie
Title: Embodied Crowd Counting
Abstract:
Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.
中文摘要:提出的具身人群计数(ECC)框架通过构建支持大规模场景的交互模拟器和采用主动相机控制的零样本导航方法,在计数精度与导航成本之间实现了最佳平衡。
English Summary: The proposed Embodied Crowd Counting (ECC) framework introduces an interactive simulator with large-scale scenes and a zero-shot navigation method that achieves optimal balance between counting accuracy and navigation efficiency through active camera control.

Authors:Hesen Chen, Junyan Wang, Zhiyu Tan, Hao Li
Title: SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models
Abstract:
Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.
中文摘要:SARA提出了一种分层对齐框架,通过实施多层级表征约束来加速扩散模型训练并提升图像质量,在ImageNet-256上以更快收敛速度实现了优于现有方法的性能。
English Summary: SARA introduces a hierarchical alignment framework that accelerates diffusion model training while improving image quality by enforcing multi-level representation constraints, achieving superior results on ImageNet-256 with faster convergence than existing methods.

Authors:Xinrui Li, Jianlong Wu, Xinchuan Huang, Chong Chen, Weili Guan, Xian-Sheng Hua, Liqiang Nie
Title: MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution
Abstract:
Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at different depths and the fine-grained, concrete semantics inherently present in the images themselves. Moreover, relying solely on a single type of guidance further disrupts the consistency of reconstruction. To address these issues, we propose MegaSR, a novel framework that mines customized block-wise semantics and expressive guidance for diffusion-based ISR. Compared to uniform textual semantics, MegaSR enables flexible adaptation to multi-granularity semantic awareness by dynamically incorporating image attributes at each block. Furthermore, we experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance, and propose a multi-stage aggregation strategy to modulate them into the T2I models. Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
中文:MegaSR提出了一种新颖框架,通过定制块级语义并融合多粒度指导来提升图像超分辨率效果,在语义丰富性和结构一致性方面优于现有方法。
English: MegaSR introduces a novel framework that customizes block-wise semantics and integrates multi-granularity guidance to enhance image super-resolution, outperforming existing methods in semantic richness and structural consistency.

Authors:Hao Zhang, Fuhui Zhou, Hongyang Du, Qihui Wu, Chau Yuen
Title: Revolution of Wireless Signal Recognition for 6G: Recent Advances, Challenges and Future Directions
Abstract:
Wireless signal recognition (WSR) is a crucial technique for intelligent communications and spectrum sharing in the next six-generation (6G) wireless communication networks. It can be utilized to enhance network performance and efficiency, improve quality of service (QoS), and improve network security and reliability. Additionally, WSR can be applied for military applications such as signal interception, signal race, and signal abduction. In the past decades, great efforts have been made for the research of WSR. Earlier works mainly focus on model-based methods, including likelihood-based (LB) and feature-based (FB) methods, which have taken the leading position for many years. With the emergence of artificial intelligence (AI), intelligent methods including machine learning-based (ML-based) and deep learning-based (DL-based) methods have been developed to extract the features of the received signals and perform the classification. In this work, we provide a comprehensive review of WSR from the view of applications, main tasks, recent advances, datasets and evaluation metrics, challenges, and future directions. Specifically, intelligent WSR methods are introduced from the perspective of model, data, learning and implementation. Moreover, we analyze the challenges for WSR from the view of complex, dynamic, and open 6G wireless environments and discuss the future directions for WSR. This survey is expected to provide a comprehensive overview of the state-of-the-art WSR techniques and inspire new research directions for WSR in 6G networks.
无线信号识别是6G网络中的关键技术,通过人工智能方法提升性能和安全性,本文全面综述了其进展、挑战和未来方向。
Wireless signal recognition is a key technology for 6G networks, enhancing performance and security through AI-driven methods, as reviewed in this comprehensive survey of advances, challenges, and future directions.

Authors:Joey Wilson, Marcelino Almeida, Sachit Mahajan, Martin Labrie, Maani Ghaffari, Omid Ghasemalizadeh, Min Sun, Cheng-Hao Kuo, Arnab Sen
Title: POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality
Abstract:
In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D-GS) through P-Optimality. While 3D-GS has proven to be a useful world model with high-quality rasterizations, it does not natively quantify uncertainty or information, posing a challenge for real-world applications such as 3D-GS SLAM. We propose to quantify information gain in 3D-GS by reformulating the problem through the lens of optimal experimental design, which is a classical solution widely used in literature. By restructuring information quantification of 3D-GS through optimal experimental design, we arrive at multiple solutions, of which T-Optimality and D-Optimality perform the best quantitatively and qualitatively as measured on two popular datasets. Additionally, we propose a block diagonal covariance approximation which provides a measure of correlation at the expense of a greater computation cost.
中文: 本文提出了一种利用P-最优性量化3D高斯溅射中不确定性和信息增益的新算法,其中T-最优性和D-最优性在两个数据集上表现最佳,同时引入块对角协方差近似方法以更高计算成本衡量相关性。
English: This paper introduces a novel algorithm using P-Optimality to quantify uncertainty and information gain in 3D Gaussian Splatting, with T-Optimality and D-Optimality showing the best performance on two datasets, alongside a block diagonal covariance approximation for correlation measurement at higher computational cost.

Authors:Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
Title: MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
Abstract:
Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
中文: MergeQuant是一种高效的逐通道静态量化框架,通过量化步骤迁移方法将量化与线性映射结合,消除了自回归生成中的量化开销,在大语言模型中实现了显著加速且精度损失极小。
English: MergeQuant is an efficient per-channel static quantization framework that eliminates quantization overheads in autoregressive generation by integrating quantization steps with linear mappings, achieving significant speedup and minimal accuracy loss in large language models.

Authors:Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
Title: Mixture of Experts Made Intrinsically Interpretable
Abstract:
Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
中文摘要:MoE-X是一种专为实现内在可解释性而设计的专家混合语言模型,通过稀疏激活和专家路由机制,在保持与GPT-2等密集模型相当性能的同时,显著提升了可解释性,超越了现有方法。
English Summary: MoE-X is a Mixture-of-Experts language model designed for intrinsic interpretability by leveraging sparse activations and expert routing mechanisms, achieving performance comparable to dense models like GPT-2 while significantly enhancing interpretability beyond existing methods.

Authors:Paul Mangold, Alain Durmus, Aymeric Dieuleveut, Eric Moulines
Title: Scaffold with Stochastic Gradients: New Analysis with Linear Speed-Up
Abstract:
This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settings--where local control variates mitigate client drift--is well established, the impact of stochastic gradient updates on its performance is less understood. To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance. Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size. Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms
中文: 本文分析了联邦学习中的Scaffold算法,证明其在客户端数量上实现线性加速,但保留了与FedAvg类似的高阶偏差,揭示了改进随机算法的机遇。
English: This paper analyzes the Scaffold algorithm in federated learning, showing it achieves linear speed-up with client numbers but retains a higher-order bias similar to FedAvg, revealing opportunities for improved stochastic algorithms.

Authors:Jiahui Zhang, Fangneng Zhan, Ling Shao, Shijian Lu
Title: SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
Abstract:
Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality - large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.
中文: SOGS技术通过引入二阶锚点和选择性梯度损失,利用协方差统计增强特征表达并优化场景细节,在减小模型规模的同时实现了更高质量的3D渲染效果。
English: SOGS introduces second-order anchors and selective gradient loss to enhance 3D Gaussian splatting, achieving superior rendering quality with reduced model size by compensating for smaller anchor features through covariance-based statistics and improved optimization.

Authors:Shuhao Liao, Xuxin Lv, Yuhong Cao, Jeric Lew, Wenjun Wu, Guillaume Sartoretti
Title: HELM: Human-Preferred Exploration with Language Models
Abstract:
In autonomous exploration tasks, robots are required to explore and map unknown environments while efficiently planning in dynamic and uncertain conditions. Given the significant variability of environments, human operators often have specific preference requirements for exploration, such as prioritizing certain areas or optimizing for different aspects of efficiency. However, existing methods struggle to accommodate these human preferences adaptively, often requiring extensive parameter tuning or network retraining. With the recent advancements in Large Language Models (LLMs), which have been widely applied to text-based planning and complex reasoning, their potential for enhancing autonomous exploration is becoming increasingly promising. Motivated by this, we propose an LLM-based human-preferred exploration framework that seamlessly integrates a mobile robot system with LLMs. By leveraging the reasoning and adaptability of LLMs, our approach enables intuitive and flexible preference control through natural language while maintaining a task success rate comparable to state-of-the-art traditional methods. Experimental results demonstrate that our framework effectively bridges the gap between human intent and policy preference in autonomous exploration, offering a more user-friendly and adaptable solution for real-world robotic applications.
中文: 本文提出了一种基于大语言模型的框架,通过自然语言将人类偏好融入自主机器人探索中,实现了灵活控制,同时保持了与传统方法相当的高任务成功率。
English: This paper introduces an LLM-based framework that integrates human preferences through natural language for autonomous robot exploration, enabling flexible control while maintaining high task success rates comparable to traditional methods.

Authors:Stylianos Zindros, Christos Chronis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Title: Public space security management using digital twin technologies
Abstract:
As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.
中文: 通过使用FlexSim软件对雅典地铁站进行数字孪生模拟,研究表明优化监控摄像头布局和角度可显著提升可疑行为检测能力,为公共空间安全管理提供创新解决方案。
English: Digital Twin technology, demonstrated through a simulated Athens metro station using FlexSim, effectively optimizes surveillance camera placement and angles to enhance threat detection and security management in public spaces.

Authors:Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Kohou Wang, Kai Wang, Shiguo Lian
Title: A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Abstract:
While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.
中文摘要:本研究推出开创性的CDDM多模态数据集,包含13.7万张作物病害图像和100万组问答对,通过LoRA等先进AI微调技术显著提升了作物病害诊断能力。
English Summary: This study introduces the CDDM dataset, a pioneering multimodal resource with 137,000 crop disease images and 1 million Q&A pairs, demonstrating enhanced disease diagnosis through advanced AI techniques like LoRA fine-tuning.

Authors:Xian Gao, Jiacheng Ruan, Jingsheng Gao, Mingye Xie, Zongyun Zhang, Ting Liu, Yuzhuo Fu
Title: From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes
Abstract:
Analyzing student behavior in educational scenarios is crucial for enhancing teaching quality and student engagement. Existing AI-based models often rely on classroom video footage to identify and analyze student behavior. While these video-based methods can partially capture and analyze student actions, they struggle to accurately track each student's actions in physical education classes, which take place in outdoor, open spaces with diverse activities, and are challenging to generalize to the specialized technical movements involved in these settings. Furthermore, current methods typically lack the ability to integrate specialized pedagogical knowledge, limiting their ability to provide in-depth insights into student behavior and offer feedback for optimizing instructional design. To address these limitations, we propose a unified end-to-end framework that leverages human activity recognition technologies based on motion signals, combined with advanced large language models, to conduct more detailed analyses and feedback of student behavior in physical education classes. Our framework begins with the teacher's instructional designs and the motion signals from students during physical education sessions, ultimately generating automated reports with teaching insights and suggestions for improving both learning and class instructions. This solution provides a motion signal-based approach for analyzing student behavior and optimizing instructional design tailored to physical education classes. Experimental results demonstrate that our framework can accurately identify student behaviors and produce meaningful pedagogical insights.
中文摘要:本文提出了一种结合运动信号活动识别与大语言模型的统一框架,用于分析体育课学生行为,通过生成自动化教学洞察和教学建议,克服了视频分析在户外体育场景中的局限性。
English Summary: This paper introduces a unified framework that combines motion-based activity recognition with large language models to analyze student behavior in physical education, overcoming limitations of video-based methods by generating automated teaching insights and instructional suggestions.

Authors:Yanjun Chen, Yirong Sun, Xinghao Chen, Jian Wang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Title: Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
Abstract:
Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks. The dataset will be publicly available at https://huggingface.co/datasets/Battam/3D-CoT
中文: 本研究通过引入3D-CoT基准将思维链推理融入三维视觉语言对齐,证明结构化标注能显著提升多模态推理能力,其中大型推理模型比语言模型更能有效利用这种增强机制。
English: This study introduces the 3D-CoT Benchmark to integrate Chain-of-Thought reasoning into 3D vision-language alignment, demonstrating that structured annotations significantly enhance multimodal reasoning, particularly benefiting large reasoning models more effectively than language models.

Authors:Yubin Wang, Xinyang Jiang, De Cheng, Xiangqian Zhao, Zilong Wang, Dongsheng Li, Cairong Zhao
Title: Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts
Abstract:
Visual prompt tuning offers significant advantages for adapting pre-trained visual foundation models to specific tasks. However, current research provides limited insight into the interpretability of this approach, which is essential for enhancing AI reliability and enabling AI-driven knowledge discovery. In this paper, rather than learning abstract prompt embeddings, we propose the first framework, named Interpretable Visual Prompt Tuning (IVPT), to explore interpretability for visual prompts, by introducing hierarchical concept prototypes. Specifically, visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes, each corresponding to a specific region of the image. Then, IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities. Comprehensive qualitative and quantitative evaluations on fine-grained classification benchmarks show its superior interpretability and performance over conventional visual prompt tuning methods and existing interpretable methods.
Chinese: 提出的可解释视觉提示调优(IVPT)框架通过将提示与对应图像区域的层次化概念原型相关联,显著提升了视觉提示的可解释性,在评估中展现出优于现有方法的性能和清晰度。
English: The proposed Interpretable Visual Prompt Tuning (IVPT) framework enhances visual prompt interpretability by linking prompts to hierarchical concept prototypes corresponding to image regions, demonstrating superior performance and clarity over existing methods in evaluations.

Authors:Xinjie Liu, Cyrus Neary, Kushagra Gupta, Christian Ellis, Ufuk Topcu, David Fridovich-Keil
Title: Multi-Fidelity Policy Gradient Algorithms
Abstract:
Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators--such as reduced-order models, heuristic reward functions, or generative world models--can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples--up to 10x as many interactions with the target environment--MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.
中文: 多保真度策略梯度(MFPG)框架通过将少量高保真度数据与大量低保真度模拟数据相结合,构建了无偏且方差缩减的估计器,在加速策略收敛的同时,对不同动态差异保持了卓越的鲁棒性。
English: The proposed multi-fidelity policy gradients (MFPG) framework efficiently combines limited high-fidelity data with abundant low-fidelity simulation data to create an unbiased, variance-reduced estimator that accelerates policy convergence while maintaining robustness across varying dynamics gaps.

Authors:Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, Ufuk Topcu, David Fridovich-Keil
Title: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation
Abstract:
Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators -- such as reduced-order models, heuristic rewards, or generative world models -- can cheaply provide useful data for RL training, even if they are too coarse for zero-shot transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework with a multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild-moderate dynamics gaps, MFPG reliably improves the median performance over a high-fidelity-only baseline, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even under low-fidelity reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.
中文: 多保真度策略梯度(MFPG)框架通过将少量高保真度数据与大量低保真度模拟数据相结合,构建了无偏且方差缩减的估计器,在加速策略收敛的同时,对不同动态差异保持了卓越的鲁棒性。
English: The proposed multi-fidelity policy gradients (MFPG) framework efficiently combines limited high-fidelity data with abundant low-fidelity simulation data to create an unbiased, variance-reduced estimator that accelerates policy convergence while maintaining robustness across varying dynamics gaps.

Authors:Zheng Zhou, Zhe Li, Bo Yu, Lina Hu, Liang Dong, Zijian Yang, Xiaoli Liu, Ning Xu, Ziwei Wang, Yonghao Dang, Jianqin Yin
Title: GaussianCAD: Robust Self-Supervised CAD Reconstruction from Three Orthographic Views Using 3D Gaussian Splatting
Abstract:
The automatic reconstruction of 3D computer-aided design (CAD) models from CAD sketches has recently gained significant attention in the computer vision community. Most existing methods, however, rely on vector CAD sketches and 3D ground truth for supervision, which are often difficult to be obtained in industrial applications and are sensitive to noise inputs. We propose viewing CAD reconstruction as a specific instance of sparse-view 3D reconstruction to overcome these limitations. While this reformulation offers a promising perspective, existing 3D reconstruction methods typically require natural images and corresponding camera poses as inputs, which introduces two major significant challenges: (1) modality discrepancy between CAD sketches and natural images, and (2) difficulty of accurate camera pose estimation for CAD sketches. To solve these issues, we first transform the CAD sketches into representations resembling natural images and extract corresponding masks. Next, we manually calculate the camera poses for the orthographic views to ensure accurate alignment within the 3D coordinate system. Finally, we employ a customized sparse-view 3D reconstruction method to achieve high-quality reconstructions from aligned orthographic views. By leveraging raster CAD sketches for self-supervision, our approach eliminates the reliance on vector CAD sketches and 3D ground truth. Experiments on the Sub-Fusion360 dataset demonstrate that our proposed method significantly outperforms previous approaches in CAD reconstruction performance and exhibits strong robustness to noisy inputs.
中文: 本研究将CAD模型重构重新定义为稀疏视图三维重建任务,通过将栅格草图转换为类自然图像表示并手动计算相机姿态,消除了对矢量草图和三维真值的依赖,实现了卓越的重建性能和噪声鲁棒性。
English: This study reframes CAD model reconstruction as a sparse-view 3D reconstruction task, converting raster sketches into natural image-like representations with manually calculated camera poses to eliminate dependency on vector sketches and 3D ground truth, achieving superior performance and noise robustness.

Authors:Feng Jiang, Zhiyu Lin, Fan Bu, Yuhao Du, Benyou Wang, Haizhou Li
Title: S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
Abstract:
The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.
中文摘要:该研究推出S2S-Arena新基准,评估语音模型在现实任务中处理副语言信息的能力,发现模型虽能理解此类信息,但生成合适音频仍具挑战性,且级联系统优于联合训练模型。
English Summary: The study introduces S2S-Arena, a new benchmark evaluating speech models' ability to handle paralinguistic information in real-world tasks, revealing that while models can understand such cues, generating appropriate audio remains challenging and cascaded systems outperform joint models.

Authors:Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh S. Tahaei
Title: Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models
Abstract:
Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model's performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip as well as other leading compression techniques on multiple models and at various scales, across a variety of benchmarks.
中文:Balcony是一种轻量级框架,通过在预训练大语言模型的退出点插入可训练层,使其能够灵活适应不同计算资源,以极低的性能损失实现高效推理,并显著优于现有先进方法。
English: Balcony is a lightweight framework that enhances large language models' adaptability to computational constraints by inserting trainable layers at exit points, achieving high efficiency with minimal performance loss and outperforming existing methods.

Authors:Mengdi Wang, Efe Bozkir, Enkelejda Kasneci
Title: Iris Style Transfer: Enhancing Iris Recognition with Style Features and Privacy Preservation through Neural Style Transfer
Abstract:
Iris texture is widely regarded as a gold standard biometric modality for authentication and identification. The demand for robust iris recognition methods, coupled with growing security and privacy concerns regarding iris attacks, has escalated recently. Inspired by neural style transfer, an advanced technique that leverages neural networks to separate content and style features, we hypothesize that iris texture's style features provide a reliable foundation for recognition and are more resilient to variations like rotation and perspective shifts than traditional approaches. Our experimental results support this hypothesis, showing a significantly higher classification accuracy compared to conventional features. Further, we propose using neural style transfer to obfuscate the identifiable iris style features, ensuring the protection of sensitive biometric information while maintaining the utility of eye images for tasks like eye segmentation and gaze estimation. This work opens new avenues for iris-oriented, secure, and privacy-aware biometric systems.
中文: 该研究通过神经风格迁移提取虹膜纹理的风格特征,不仅提升了识别精度和对变化的鲁棒性,还能在保护敏感生物信息隐私的同时,维持图像在眼部分割和视线估计等任务中的实用性。
English: The study demonstrates that using neural style transfer to extract style features from iris textures enhances recognition accuracy and resilience to variations, while also enabling privacy protection by obfuscating sensitive biometric data without compromising utility for other tasks.

Authors:Zhanhong Jiang, Md Zahid Hasan, Aditya Balu, Joshua R. Waite, Genyi Huang, Soumik Sarkar
Title: FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization
Abstract:
Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.
中文: 本文提出FUSE-PV优化方法,通过智能切换一阶和二阶技术,在保持实用性能的同时,相比SGD和Adam展现出更优的计算效率。
English: This paper introduces FUSE-PV, a novel optimization method that intelligently switches between first-order and second-order techniques, demonstrating superior computational efficiency compared to SGD and Adam while maintaining practical performance.

Authors:Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, Yingna Wu
Title: Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
Abstract:
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our project is available at https://guyao2023.github.io/Phys-AD/.
Chinese Summary: 人类通过感知、交互和基于物体条件的物理知识推理来检测现实世界中的物体异常,而当前工业异常检测方法因依赖静态数据集难以应对真实场景,为此我们推出了Phys-AD数据集,通过基于物理的视频数据填补这一空白。
English Summary: Humans detect object anomalies through perception, interaction, and physical reasoning, while current industrial anomaly detection methods fall short in real-world scenarios due to their reliance on static datasets, prompting the introduction of the Phys-AD dataset to bridge this gap with physics-grounded video data.

Authors:Raula Gaikovina Kula, Christoph Treude
Title: The Shift from Writing to Pruning Software: A Bonsai-Inspired IDE for Reshaping AI Generated Code
Abstract:
The rise of AI-driven coding assistants signals a fundamental shift in how software is built. While AI coding assistants have been integrated into existing Integrated Development Environments (IDEs), their full potential remains largely untapped. A key challenge is that these AI assistants can suffer from hallucinations, leading developers down decision paths that the AI should not dictate, sometimes even without the users awareness or consent. Moreover, current static-file IDEs lack the mechanisms to address critical issues such as tracking the provenance of AI-generated code and integrating version control in a way that aligns with the dynamic nature of AI-assisted development. As a result, developers are left without the necessary tools to manage, refine, and validate AI generated code systematically, making it difficult to ensure correctness, maintainability, and trust in the development process. Existing IDEs treat AI-generated code as static text, offering limited support for managing its evolution, refinement, or multiple alternative paths. Drawing inspiration from the ancient art of Japanese Bonsai gardening focused on balance, structure, and deliberate pruning: we propose a new approach to IDEs, where AI is allowed to generate in its true, unconstrained form, free from traditional file structures. This approach fosters a more fluid and interactive method for code evolution. We introduce the concept of a Bonsai-inspired IDE, structured as a graph of generated code snippets and multiple code paths, enabling developers to reshape AI generated code to suit their needs. Our vision calls for a shift away from a static file based model toward a dynamic, evolving system that allows for continuous refinement of generated code, with the IDE evolving alongside AI powered modifications rather than merely serving as a place to write and edit code.
中文: AI驱动的编程助手正在改变软件开发方式,但面临幻觉和版本控制缺失等挑战,因此提出一种受盆景艺术启发的动态图形结构IDE,以实现流畅的代码演进与优化。
English: AI-driven coding assistants are transforming software development but face challenges like hallucinations and lack of version control, prompting a Bonsai-inspired IDE that uses a dynamic graph structure for fluid code evolution and refinement.

Authors:Raula Gaikovina Kula, Brittany Anne Reid, Christoph Treude
Title: Open Source at a Crossroads: The Future of Licensing Driven by Monetization
Abstract:
The widespread adoption of open source libraries and frameworks can be attributed to their licensing. Open Source Software Licenses (OSS licenses) ensure that software can be sold or distributed as part of aggregate programs from various sources without requiring a royalty or fee. The quality of such code rivals that of commercial software, with open source libraries forming large parts of the supply chain for critical commercial systems in industry. Despite this, most open source projects rely on volunteer contributions, and unpaid library maintainers face significant pressure to sustain their projects. One potential solution for these projects is to change their licensing to ensure that maintainers are compensated accordingly for their work. In this paper, we explore the potential of licensing to help alleviate funding issues, with a review of three different cases where OSS licenses were modified to allow for monetization. In addition, we explore licensing concerns related to the emergence of the use of artificial intelligence (AI) in software development. We argue that open source is at a crossroads, with a growing need to redefine its licensing models and support communities and critical software. We identify specific research opportunities and conclude with a research agenda comprising a series of research questions to guide future studies in this area.
中文: 开源软件的广泛采用得益于其允许免费分发和高质量代码的许可协议,然而依赖志愿者的项目面临可持续性挑战,可通过修订许可模式确保合理报酬并适应人工智能技术发展来解决。
English: Open source software's broad adoption stems from its licensing that allows free distribution and high-quality code, yet volunteer-dependent projects face sustainability challenges which could be addressed through revised licensing models for fair compensation and adaptation to emerging AI technologies.

Authors:Artemis Stefanidou, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Title: State of play and future directions in industrial computer vision AI standards
Abstract:
The recent tremendous advancements in the areas of Artificial Intelligence (AI) and Deep Learning (DL) have also resulted into corresponding remarkable progress in the field of Computer Vision (CV), showcasing robust technological solutions in a wide range of application sectors of high industrial interest (e.g., healthcare, autonomous driving, automation, etc.). Despite the outstanding performance of CV systems in specific domains, their development and exploitation at industrial-scale necessitates, among other, the addressing of requirements related to the reliability, transparency, trustworthiness, security, safety, and robustness of the developed AI models. The latter raises the imperative need for the development of efficient, comprehensive and widely-adopted industrial standards. In this context, this study investigates the current state of play regarding the development of industrial computer vision AI standards, emphasizing on critical aspects, like model interpretability, data quality, and regulatory compliance. In particular, a systematic analysis of launched and currently developing CV standards, proposed by the main international standardization bodies (e.g. ISO/IEC, IEEE, DIN, etc.) is performed. The latter is complemented by a comprehensive discussion on the current challenges and future directions observed in this regularization endeavor.
中文: 人工智能和深度学习的飞速发展极大推动了计算机视觉的进步,但其工业级应用需解决可靠性、透明度及安全性等问题,亟需建立涵盖模型可解释性与数据质量的标准化体系。
English: Recent AI and deep learning advances have significantly boosted computer vision, yet industrial-scale deployment demands addressing reliability, transparency, and security, necessitating standardized frameworks for model interpretability and data quality.

Authors:Changhe Chen, Xiaohao Xu, Xiangdong Wang, Xiaonan Huang
Title: Large Language Models as Natural Selector for Embodied Soft Robot Design
Abstract:
Designing soft robots is a complex and iterative process that demands cross-disciplinary expertise in materials science, mechanics, and control, often relying on intuition and extensive experimentation. While Large Language Models (LLMs) have demonstrated impressive reasoning abilities, their capacity to learn and apply embodied design principles--crucial for creating functional robotic systems--remains largely unexplored. This paper introduces RoboCrafter-QA, a novel benchmark to evaluate whether LLMs can learn representations of soft robot designs that effectively bridge the gap between high-level task descriptions and low-level morphological and material choices. RoboCrafter-QA leverages the EvoGym simulator to generate a diverse set of soft robot design challenges, spanning robotic locomotion, manipulation, and balancing tasks. Our experiments with state-of-the-art multi-modal LLMs reveal that while these models exhibit promising capabilities in learning design representations, they struggle with fine-grained distinctions between designs with subtle performance differences. We further demonstrate the practical utility of LLMs for robot design initialization. Our code and benchmark will be available to encourage the community to foster this exciting research direction.
中文摘要:本文提出RoboCrafter-QA基准,用于评估大语言模型学习软体机器人设计原理的能力,发现其虽有潜力但难以区分细微的设计差异。
English Summary: This paper introduces RoboCrafter-QA, a benchmark to assess if Large Language Models can learn soft robot design principles, finding they show promise but struggle with fine-grained design distinctions.

Authors:Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J. Zico Kolter, Michael Qizhe Shieh
Title: Unnatural Languages Are Not Bugs but Features for LLMs
Abstract:
Large Language Models (LLMs) have been observed to process non-human-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving 49.71 win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words.
中文摘要:本研究重新定义了大语言模型对非自然语言的处理能力,证明这些人类难以理解的文本对模型具有语义价值,通过去噪和上下文推断机制,能在多种任务中达到与自然语言训练相当的性能表现。
English Summary: This study reframes the processing of unnatural languages by Large Language Models (LLMs) from a vulnerability to a functional capability, showing these human-incomprehensible texts retain semantic meaning for models and can achieve performance comparable to natural language training across diverse tasks.

Authors:Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang
Title: Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Abstract:
Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N (BoN) sampling serves as a common scaling technique, broadening the search space for finding better solutions from the model distribution. However, traditional BoN requires N full generations, leading to high GPU memory overhead and time latency. Moreover, some methods depend on reward models, adding computational cost and limiting domain generalization. In this paper, we propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings and eliminates the need for reward models. ST-BoN introduces early sampling consistency to estimate the most promising sample, truncating suboptimal ones to free memory and accelerate inference. This pushes the sampling-efficient test-time scaling. Compared to traditional BoN, ST-BoN can reduce dynamic GPU memory overhead by over 90% and time latency by 50%, while achieving comparable or even better performance across reasoning and open-ended domains.
中文摘要:提出的自截断最佳N采样(ST-BoN)方法通过在解码过程中提前截断次优样本,在无需奖励模型的情况下,将动态GPU内存开销降低90%以上、时间延迟减少50%,同时实现与传统最佳N采样相当甚至更优的性能表现。
English Summary: The proposed Self-Truncation Best-of-N (ST-BoN) method significantly reduces GPU memory usage and time latency by early truncation of suboptimal samples during decoding, achieving comparable or superior performance to traditional Best-of-N sampling without requiring reward models.

Authors:Yuhan Jing, Jingyu Wang, Lei Zhang, Haifeng Sun, Bo He, Zirui Zhuang, Chengsen Wang, Qi Qi, Jianxin Liao
Title: OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest
Abstract:
With the growing adoption of time-series anomaly detection (TAD) technology, numerous studies have employed deep learning-based detectors for analyzing time-series data in the fields of Internet services, industrial systems, and sensors. The selection and optimization of anomaly detectors strongly rely on the availability of an effective performance evaluation method for TAD. Since anomalies in time-series data often manifest as a sequence of points, conventional metrics that solely consider the detection of individual point are inadequate. Existing evaluation methods for TAD typically employ point-based or event-based metrics to capture the temporal context. However, point-based metrics tend to overestimate detectors that excel only in detecting long anomalies, while event-based metrics are susceptible to being misled by fragmented detection results. To address these limitations, we propose OIPR, a novel set of TAD evaluation metrics. It models the process of operators receiving detector alarms and handling faults, utilizing area under the operator interest curve to evaluate the performance of TAD algorithms. Furthermore, we build a special scenario dataset to compare the characteristics of different evaluation methods. Through experiments conducted on the special scenario dataset and five real-world datasets, we demonstrate the remarkable performance of OIPR in extreme and complex scenarios. It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.
中文摘要:OIPR评估指标通过模拟操作员接收警报和处理故障的过程,利用操作员兴趣曲线下面积来平衡点基和事件基评估方法的局限性,为时间序列异常检测提供更全面的性能评估。
English Summary: The OIPR evaluation metric is introduced to overcome the limitations of traditional point-based and event-based time-series anomaly detection assessments by modeling operator alarm handling processes and utilizing the area under the operator interest curve for balanced performance evaluation.

Authors:Kota Nakamura, Koki Kawabata, Shungo Tanaka, Yasuko Matsubara, Yasushi Sakurai
Title: CyberCScope: Mining Skewed Tensor Streams and Online Anomaly Detection in Cybersecurity Systems
Abstract:
Cybersecurity systems are continuously producing a huge number of time-stamped events in the form of high-order tensors, such as {count; time, port, flow duration, packet size, . . . }, and so how can we detect anomalies/intrusions in real time? How can we identify multiple types of intrusions and capture their characteristic behaviors? The tensor data consists of categorical and continuous attributes and the data distributions of continuous attributes typically exhibit skew. These data properties require handling skewed infinite and finite dimensional spaces simultaneously. In this paper, we propose a novel streaming method, namely CyberCScope. The method effectively decomposes incoming tensors into major trends while explicitly distinguishing between categorical and skewed continuous attributes. To our knowledge, it is the first to compute hybrid skewed infinite and finite dimensional decomposition. Based on this decomposition, it streamingly finds distinct time-evolving patterns, enabling the detection of multiple types of anomalies. Extensive experiments on large-scale real datasets demonstrate that CyberCScope detects various intrusions with higher accuracy than state-of-the-art baselines while providing meaningful summaries for the intrusions that occur in practice.
中文摘要:CyberCScope是一种新型流式处理方法,能有效区分分类属性和偏斜连续属性来分解网络安全张量数据,从而以高精度实时检测多种入侵类型。
English Summary: CyberCScope is a novel streaming method that effectively decomposes cybersecurity tensor data by distinguishing categorical and skewed continuous attributes, enabling real-time detection of multiple intrusion types with high accuracy.

Authors:Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang
Title: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
Abstract:
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.
中文摘要:Babel是一种创新的开源多语言大模型,覆盖25种使用广泛的语言(包括资源匮乏语言),通过分层扩展技术推出两种可扩展变体,在多语言任务中表现出卓越性能,达到商业模型水平。
English Summary: Babel is an innovative open multilingual large language model that expands language coverage to 25 widely spoken languages, including under-resourced ones, and introduces two scalable variants demonstrating superior performance in multilingual tasks, rivaling commercial models.

Authors:Miao Peng, Nuo Chen, Zongrui Suo, Jia Li
Title: Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
Abstract:
Despite significant advancements in Large Language Models (LLMs), developing advanced reasoning capabilities in LLMs remains a key challenge. Process Reward Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by providing step-wise feedback, particularly in the context of mathematical reasoning. However, their application to broader reasoning domains remains understudied, largely due to the high costs associated with manually creating step-level supervision. In this work, we explore the potential of PRMs in graph reasoning problems - a domain that demands sophisticated multi-step reasoning and offers opportunities for automated step-level data generation using established graph algorithms. We introduce GraphSILO, the largest dataset for graph reasoning problems with fine-grained step-wise labels, built using automated Task-oriented Trajectories and Monte Carlo Tree Search (MCTS) to generate detailed reasoning steps with step-wise labels. Building upon this dataset, we train GraphPRM, the first PRM designed for graph reasoning problems, and evaluate its effectiveness in two key settings: inference-time scaling and reinforcement learning via Direct Preference Optimization (DPO). Experimental results show that GraphPRM significantly improves LLM performance across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5-7B and demonstrating transferability to new graph reasoning datasets and new reasoning domains like mathematical problem-solving. Notably, GraphPRM enhances LLM performance on GSM8K and Math500, underscoring the cross-domain applicability of graph-based reasoning rewards. Our findings highlight the potential of PRMs in advancing reasoning across diverse domains, paving the way for more versatile and effective LLMs.
中文摘要:过程奖励模型在增强大语言模型推理能力方面展现出巨大潜力,GraphPRM在图形推理任务中的显著性能提升及其向数学解题等领域的可迁移性,证明了该方法的跨领域适用性。
English Summary: Process Reward Models (PRMs) show strong potential in improving reasoning across diverse domains, as demonstrated by GraphPRM's significant performance gains in graph reasoning and transferability to mathematical problem-solving.

Authors:Christoph Treude, Raula Gaikovina Kula
Title: Interacting with AI Reasoning Models: Harnessing "Thoughts" for AI-Driven Software Engineering
Abstract:
Recent advances in AI reasoning models provide unprecedented transparency into their decision-making processes, transforming them from traditional black-box systems into models that articulate step-by-step chains of thought rather than producing opaque outputs. This shift has the potential to improve software quality, explainability, and trust in AI-augmented development. However, software engineers rarely have the time or cognitive bandwidth to analyze, verify, and interpret every AI-generated thought in detail. Without an effective interface, this transparency could become a burden rather than a benefit. In this paper, we propose a vision for structuring the interaction between AI reasoning models and software engineers to maximize trust, efficiency, and decision-making power. We argue that simply exposing AI's reasoning is not enough -- software engineers need tools and frameworks that selectively highlight critical insights, filter out noise, and facilitate rapid validation of key assumptions. To illustrate this challenge, we present motivating examples in which AI reasoning models state their assumptions when deciding which external library to use and produce divergent reasoning paths and recommendations about security vulnerabilities, highlighting the need for an interface that prioritizes actionable insights while managing uncertainty and resolving conflicts. We then outline a research roadmap for integrating automated summarization, assumption validation, and multi-model conflict resolution into software engineering workflows. Achieving this vision will unlock the full potential of AI reasoning models to enable software engineers to make faster, more informed decisions without being overwhelmed by unnecessary detail.
Chinese: 人工智能推理模型现在能提供清晰的逐步决策过程,但若缺乏有效界面来突出关键见解并过滤干扰信息,这种透明度反而可能使软件工程师不堪重负而非提升工作效率。
English: AI reasoning models now offer clear step-by-step decision processes, but without effective interfaces to highlight key insights and filter noise, this transparency risks overwhelming software engineers instead of enhancing their work.

Authors:Rui Lu, Yang Yue, Andrew Zhao, Simon Du, Gao Huang
Title: Towards Understanding the Benefit of Multitask Representation Learning in Decision Process
Abstract:
Multitask Representation Learning (MRL) has emerged as a prevalent technique to improve sample efficiency in Reinforcement Learning (RL). Empirical studies have found that training agents on multiple tasks simultaneously within online and transfer learning environments can greatly improve efficiency. Despite its popularity, a comprehensive theoretical framework that elucidates its operational efficacy remains incomplete. Prior analyses have predominantly assumed that agents either possess a pre-known representation function or utilize functions from a linear class, where both are impractical. The complexity of real-world applications typically requires the use of sophisticated, non-linear functions such as neural networks as representation function, which are not pre-existing but must be learned. Our work tries to fill the gap by extending the analysis to \textit{unknown non-linear} representations, giving a comprehensive analysis for its mechanism in online and transfer learning setting. We consider the setting that an agent simultaneously playing $M$ contextual bandits (or MDPs), developing a shared representation function $ϕ$ from a non-linear function class $Φ$ using our novel Generalized Functional Upper Confidence Bound algorithm (GFUCB). We formally prove that this approach yields a regret upper bound that outperforms the lower bound associated with learning $M$ separate tasks, marking the first demonstration of MRL's efficacy in a general function class. This framework also explains the contribution of representations to transfer learning when faced with new, yet related tasks, and identifies key conditions for successful transfer. Empirical experiments further corroborate our theoretical findings.
Chinese: 本研究将多任务表示学习的理论分析扩展到未知非线性表示,通过新算法和实证验证,证明了其在在线和迁移学习中的有效性。
English: This study extends the theoretical analysis of Multitask Representation Learning to unknown non-linear representations, demonstrating its efficacy in online and transfer learning through a novel algorithm and empirical validation.

Authors:Vedant Khandelwal, Kaushik Roy, Valerie Lookingbill, Ritvik Garimella, Harshul Surana, Heather Heckman, Amit Sheth
Title: NeuroLit Navigator: A Neurosymbolic Approach to Scholarly Article Searches for Systematic Reviews
Abstract:
The introduction of Large Language Models (LLMs) has significantly impacted various fields, including education, for example, by enabling the creation of personalized learning materials. However, their use in Systematic Reviews (SRs) reveals limitations such as restricted access to specialized vocabularies, lack of domain-specific reasoning, and a tendency to generate inaccurate information. Existing SR tools often rely on traditional NLP methods and fail to address these issues adequately. To overcome these challenges, we developed the ``NeuroLit Navigator,'' a system that combines domain-specific LLMs with structured knowledge sources like Medical Subject Headings (MeSH) and the Unified Medical Language System (UMLS). This integration enhances query formulation, expands search vocabularies, and deepens search scopes, enabling more precise searches. Deployed in multiple universities and tested by over a dozen librarians, the NeuroLit Navigator has reduced the time required for initial literature searches by 90\%. Despite this efficiency, the initial set of articles retrieved can vary in relevance and quality. Nonetheless, the system has greatly improved the reproducibility of search results, demonstrating its potential to support librarians in the SR process.
大型语言模型在教育等领域影响显著,但在系统综述中存在词汇受限、推理不足等局限;为此开发的NeuroLit Navigator结合领域专用模型与结构化知识,提升了搜索精度和可重复性,将初始文献搜索时间减少90%,尽管初步检索结果的相关性可能有所波动。
Large Language Models (LLMs) have advanced fields like education but face limitations in Systematic Reviews, leading to the development of the NeuroLit Navigator, which integrates domain-specific LLMs with structured knowledge to enhance search precision, reduce search time by 90%, and improve reproducibility despite variability in initial article relevance.

Authors:Rana Muhammad Shahroz Khan, Dongwen Tang, Pingzhi Li, Kai Wang, Tianlong Chen
Title: ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
Abstract:
Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.
中文: ORAL提出了一种条件循环扩散框架,可为持续演进的大语言模型生成可扩展且可控的LoRA参数,在多种任务中达到与传统训练相当或更优的性能。
English: ORAL introduces a conditional recurrent diffusion framework that generates scalable and controllable LoRA parameters for evolving large language models, achieving performance comparable to traditional training across diverse tasks.

Authors:Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Title: SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development
Abstract:
High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.
Chinese Summary: SpeechDialogueFactory框架通过完整的生成流程高效合成自然语音对话,其质量可与人工录音相媲美并大幅降低生产成本,已作为开源工具包发布以支持语音大模型的研发。
English Summary: The SpeechDialogueFactory framework efficiently generates natural speech dialogues through a comprehensive pipeline that achieves quality comparable to human recordings while significantly reducing production costs, released as an open-source toolkit for Speech-LLM development.

Authors:Jiangnan Li, Thuy-Trang Vu, Christian Herold, Amirhossein Tebbifakhr, Shahram Khadivi, Gholamreza Haffari
Title: CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment
Abstract:
Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.
中文:CONGRAD方法通过筛选梯度冲突最小的优质样本,有效缓解了多语言偏好对齐中的负干扰问题,在多种语言上展现出卓越性能且对齐代价极低。
English: The proposed CONGRAD method effectively mitigates negative interference in multilingual preference alignment for LLMs by filtering high-quality samples with minimal gradient conflicts, demonstrating superior performance across multiple languages with minimal alignment tax.

Authors:Stephen Meisenbacher, Chaeeun Joy Lee, Florian Matthes
Title: Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting
Abstract:
The task of $\textit{Differentially Private Text Rewriting}$ is a class of text privatization techniques in which (sensitive) input textual documents are $\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation behind such methods is to hide both explicit and implicit identifiers that could be contained in text, while still retaining the semantic meaning of the original text, thus preserving utility. Recent years have seen an uptick in research output in this field, offering a diverse array of word-, sentence-, and document-level DP rewriting methods. Common to these methods is the selection of a privacy budget (i.e., the $\varepsilon$ parameter), which governs the degree to which a text is privatized. One major limitation of previous works, stemming directly from the unique structure of language itself, is the lack of consideration of $\textit{where}$ the privacy budget should be allocated, as not all aspects of language, and therefore text, are equally sensitive or personal. In this work, we are the first to address this shortcoming, asking the question of how a given privacy budget can be intelligently and sensibly distributed amongst a target document. We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. In a series of privacy and utility experiments, we empirically demonstrate that given the same privacy budget, intelligent distribution leads to higher privacy levels and more positive trade-offs than a naive distribution of $\varepsilon$. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
中文: 差分隐私文本重写旨在通过智能分配隐私预算到文本的不同部分,在差分隐私保护下实现文本私有化,相比简单分配,这种方法能提高隐私水平和效用平衡。
English: Differential Private Text Rewriting aims to privatize text under differential privacy by intelligently allocating the privacy budget to different parts of the document, which enhances privacy and utility compared to naive distribution.

Authors:Hadrien Reynaud, Alberto Gomez, Paul Leeson, Qingjie Meng, Bernhard Kainz
Title: EchoFlow: A Foundation Model for Cardiac Ultrasound Image and Video Generation
Abstract:
Advances in deep learning have significantly enhanced medical image analysis, yet the availability of large-scale medical datasets remains constrained by patient privacy concerns. We present EchoFlow, a novel framework designed to generate high-quality, privacy-preserving synthetic echocardiogram images and videos. EchoFlow comprises four key components: an adversarial variational autoencoder for defining an efficient latent representation of cardiac ultrasound images, a latent image flow matching model for generating accurate latent echocardiogram images, a latent re-identification model to ensure privacy by filtering images anatomically, and a latent video flow matching model for animating latent images into realistic echocardiogram videos conditioned on ejection fraction. We rigorously evaluate our synthetic datasets on the clinically relevant task of ejection fraction regression and demonstrate, for the first time, that downstream models trained exclusively on EchoFlow-generated synthetic datasets achieve performance parity with models trained on real datasets. We release our models and synthetic datasets, enabling broader, privacy-compliant research in medical ultrasound imaging at https://huggingface.co/spaces/HReynaud/EchoFlow.
Chinese: EchoFlow提出了一种创新框架,用于生成高质量、保护隐私的合成超声心动图图像和视频,使基于该合成数据训练的下游模型能达到与使用真实数据集相当的性能表现。
English: EchoFlow introduces a novel framework for generating high-quality, privacy-preserving synthetic echocardiogram images and videos, enabling downstream models trained on this synthetic data to achieve performance comparable to those using real datasets.

Authors:Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Noboru Harada
Title: Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes
Abstract:
Immersive communication has made significant advancements, especially with the release of the codec for Immersive Voice and Audio Services. Aiming at its further realization, the DCASE 2025 Challenge has recently introduced a task for spatial semantic segmentation of sound scenes (S5), which focuses on detecting and separating sound events in spatial sound scenes. In this paper, we explore methods for addressing the S5 task. Specifically, we present baseline S5 systems that combine audio tagging (AT) and label-queried source separation (LSS) models. We investigate two LSS approaches based on the ResUNet architecture: a) extracting a single source for each detected event and b) querying multiple sources concurrently. Since each separated source in S5 is identified by its sound event class label, we propose new class-aware metrics to evaluate both the sound sources and labels simultaneously. Experimental results on first-order ambisonics spatial audio demonstrate the effectiveness of the proposed systems and confirm the efficacy of the metrics.
中文:DCASE 2025挑战赛提出声音场景空间语义分割任务S5,通过结合音频标记与标签查询源分离模型的基线系统,在环境立体声数据上采用新型类别感知指标进行评估验证。
English: The DCASE 2025 Challenge introduces the S5 task for spatial semantic segmentation of sound scenes, where baseline systems combining audio tagging and label-queried source separation models are proposed and evaluated using novel class-aware metrics on ambisonics audio.

Authors:Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, Siyuan Huang
Title: ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
Abstract:
Human hands play a central role in interacting, motivating increasing research in dexterous robotic manipulation. Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. To address this, we introduce ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. ManipTrans first pre-trains a generalist trajectory imitator to mimic hand motion, then fine-tunes a specific residual module under interaction constraints, enabling efficient learning and accurate execution of complex bimanual tasks. Experiments show that ManipTrans surpasses state-of-the-art methods in success rate, fidelity, and efficiency. Leveraging ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing. DexManipNet comprises 3.3K episodes of robotic manipulation and is easily extensible, facilitating further policy training for dexterous hands and enabling real-world deployments.
Chinese: 本研究提出了ManipTrans,一种在仿真中将人类双手技能高效迁移至灵巧机械手的两阶段方法,该方法性能优于现有技术,并构建了大规模灵巧操作数据集DexManipNet。
English: The study introduces ManipTrans, a two-stage method for transferring human bimanual skills to robotic hands in simulation, which outperforms existing techniques and enables the creation of DexManipNet, a large-scale dataset for dexterous manipulation tasks.

Authors:Jizhou Han, Chenhao Ding, Yuhang He, Songlin Dong, Qiang Wang, Xinyuan Gao, Yihong Gong
Title: Learn by Reasoning: Analogical Weight Generation for Few-Shot Class-Incremental Learning
Abstract:
Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.
中文: 本文提出一种受大脑启发的类比生成方法,用于少样本类增量学习,无需微调即可生成新类权重,在多个基准数据集上相比现有最优方法实现了更高的准确率。
English: This paper introduces a brain-inspired analogical generative method for few-shot class-incremental learning, which generates new class weights without fine-tuning and achieves superior accuracy on benchmark datasets compared to state-of-the-art methods.

Authors:Minzhao Liu, Ruslan Shaydulin, Pradeep Niroula, Matthew DeCross, Shih-Han Hung, Wen Yu Kon, Enrique Cervero-Martín, Kaushik Chakraborty, Omar Amer, Scott Aaronson, Atithi Acharya, Yuri Alexeev, K. Jordan Berg, Shouvanik Chakrabarti, Florian J. Curchod, Joan M. Dreiling, Neal Erickson, Cameron Foltz, Michael Foss-Feig, David Hayes, Travis S. Humble, Niraj Kumar, Jeffrey Larson, Danylo Lykov, Michael Mills, Steven A. Moses, Brian Neyenhuis, Shaltiel Eloul, Peter Siegfried, James Walker, Charles Lim, Marco Pistoia
Title: Certified randomness using a trapped-ion quantum processor
Abstract:
While quantum computers have the potential to perform a wide range of practically important tasks beyond the capabilities of classical computers, realizing this potential remains a challenge. One such task is to use an untrusted remote device to generate random bits that can be certified to contain a certain amount of entropy. Certified randomness has many applications but is fundamentally impossible to achieve solely by classical computation. In this work, we demonstrate the generation of certifiably random bits using the 56-qubit Quantinuum H2-1 trapped-ion quantum computer accessed over the internet. Our protocol leverages the classical hardness of recent random circuit sampling demonstrations: a client generates quantum "challenge" circuits using a small randomness seed, sends them to an untrusted quantum server to execute, and verifies the server's results. We analyze the security of our protocol against a restricted class of realistic near-term adversaries. Using classical verification with measured combined sustained performance of $1.1\times10^{18}$ floating-point operations per second across multiple supercomputers, we certify $71,313$ bits of entropy under this restricted adversary and additional assumptions. Our results demonstrate a step towards the practical applicability of today's quantum computers.
中文: 本研究通过基于互联网的协议,利用56量子位离子阱量子计算机生成经认证的随机比特,通过经典计算验证服务器结果,认证了71,313比特的熵,推动了量子计算机的实际应用。
English: This study demonstrates the generation of certified random bits using a 56-qubit trapped-ion quantum computer via an internet-based protocol that verifies server results against classical computations, certifying 71,313 bits of entropy and advancing practical quantum applications.

Authors:Yuxuan Chen, Jiawen Li, Jiali Hu, Xitong Ling, Tian Guan, Anjia Han, Yonghong He
Title: Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology
Abstract:
With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.
中文:ProAlign提出了一种跨模态无监督框架,通过大语言模型生成文本描述并结合图像-文本对比学习构建通用病理切片表征,在多个数据集上超越现有无监督方法并达到部分弱监督模型的性能水平。
English: ProAlign introduces a cross-modal unsupervised framework that leverages LLM-generated text descriptions and patch-text contrast to create versatile slide representations, outperforming existing unsupervised methods and matching some weakly supervised models' performance across multiple datasets.

Authors:Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen
Title: AvatarArtist: Open-Domain 4D Avatarization
Abstract:
This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies.
中文:本研究提出AvatarArtist方法,通过结合生成对抗网络和扩散模型,能够从任意风格的肖像图像生成高质量4D虚拟形象,展现出强大的跨域适应能力。
English: This research introduces AvatarArtist, a method for creating 4D avatars from portrait images by combining GANs and diffusion models to handle diverse styles effectively.

Authors:Omar Amer, Shouvanik Chakrabarti, Kaushik Chakraborty, Shaltiel Eloul, Niraj Kumar, Charles Lim, Minzhao Liu, Pradeep Niroula, Yash Satsangi, Ruslan Shaydulin, Marco Pistoia
Title: Applications of Certified Randomness
Abstract:
Certified randomness can be generated with untrusted remote quantum computers using multiple known protocols, one of which has been recently realized experimentally. Unlike the randomness sources accessible on today's classical computers, the output of these protocols can be certified to be random under certain computational hardness assumptions, with no trust required in the hardware generating the randomness. In this perspective, we explore real-world applications for which the use of certified randomness protocols may lead to improved security and fairness. We identify promising applications in areas including cryptography, differential privacy, financial markets, and blockchain. Through this initial exploration, we hope to shed light on potential applications of certified randomness.
中文: 利用不可信量子计算机的认证随机性协议可在无需信任硬件的情况下确保安全与公平,在密码学、隐私保护、金融和区块链等领域具有广阔应用前景。
English: Certified randomness protocols using untrusted quantum computers can ensure secure and fair outcomes without hardware trust, with promising applications in cryptography, privacy, finance, and blockchain.

Authors:Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Lizhe Chen, Baolong Bi, Xueqi Cheng
Title: Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking
Abstract:
Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?" In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs' performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs' overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs' performance through appropriate prompting strategies.
This study reveals that Chain-of-Thought prompting significantly enhances Reasoning Large Language Models' performance across mathematical tasks, with one-shot CoT proving most effective by optimizing thinking token distribution and mitigating overfitting to reflection patterns.
English Summary:

Authors:Yuchao Gu, Weijia Mao, Mike Zheng Shou
Title: Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Abstract:
Long-context video modeling is essential for enabling generative models to function as world simulators, as they must maintain temporal coherence over extended time spans. However, most existing models are trained on short clips, limiting their ability to capture long-range dependencies, even with test-time extrapolation. While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive. To support exploring efficient long-context video modeling, we first establish a strong autoregressive baseline called Frame AutoRegressive (FAR). FAR models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models. Based on this baseline, we observe context redundancy in video autoregression. Nearby frames are critical for maintaining temporal consistency, whereas distant frames primarily serve as context memory. To eliminate this redundancy, we propose the long short-term context modeling using asymmetric patchify kernels, which apply large kernels to distant frames to reduce redundant tokens, and standard kernels to local frames to preserve fine-grained detail. This significantly reduces the training cost of long videos. Our method achieves state-of-the-art results on both short and long video generation, providing an effective baseline for long-context autoregressive video modeling.
中文: 为解决长上下文视频建模的计算难题,我们提出了帧自回归(FAR)基线方法,采用非对称分块核技术,对远距离帧使用大核减少冗余,对邻近帧保留标准核以维持细节,从而在视频生成中取得领先性能。
English: To address the computational challenges of long-context video modeling, we introduce Frame AutoRegressive (FAR), a baseline that leverages asymmetric patchify kernels to reduce redundancy by applying large kernels to distant frames and standard ones to nearby frames, achieving state-of-the-art results in video generation.

Authors:Guangsheng Ou, Mingwei Liu, Yuxuan Chen, Xueying Du, Shengbo Wang, Zekai Zhang, Xin Peng, Zibin Zheng
Title: Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-Augmented
Abstract:
Large language models (LLMs) have behaved well in function-level code translation without repository-level context. However, the performance of LLMs in repository-level context code translation remains suboptimal due to complex dependencies and context, hindering their adoption in industrial settings. In this work, we propose a novel LLM-based code translation technique K-Trans, which leverages triple knowledge augmentation to enhance LLM's translation quality under repository context in real-world software development. First, K-Trans constructs a translation knowledge base by extracting relevant information from target-language codebases, the repository being translated, and prior translation results. Second, for each function to be translated, K-Trans retrieves relevant triple knowledge, including target-language code samples, dependency usage examples, and successful translation function pairs, serving as references to enhance LLM for translation. Third, K-Trans constructs a knowledge-augmented translation prompt using the retrieved triple knowledge and employs LLMs to generate the translated code while preserving repository context. It further leverages LLMs for self-debugging, enhancing translation correctness. The experiments show that K-Trans substantially outperforms the baseline adapted from previous work by 19.4%/40.2% relative improvement in pass@1 and 0.138 in CodeBLEU. It is important to note that the results also demonstrate that each knowledge significantly contributes to K-Trans's effectiveness in handling repository-level context code translation, with dependency usage examples making the most notable contribution. Moreover, as the self-evolution process progresses, the knowledge base continuously enhances the LLM's performance across various aspects of the repository-level code translation.
中文: K-Trans通过从目标语言代码库和先前翻译结果构建知识库,利用三重知识增强和自调试机制,显著提升了大型语言模型在仓库级代码翻译中的性能表现。
English: K-Trans enhances LLM-based code translation by constructing a knowledge base from target-language codebases and prior translations, then using triple knowledge augmentation and self-debugging to significantly improve performance in repository-level contexts.

Authors:Qiang Wang, Yuhang He, SongLin Dong, Xiang Song, Jizhou Han, Haoyu Luo, Yihong Gong
Title: DualCP: Rehearsal-Free Domain-Incremental Learning via Dual-Level Concept Prototype
Abstract:
Domain-Incremental Learning (DIL) enables vision models to adapt to changing conditions in real-world environments while maintaining the knowledge acquired from previous domains. Given privacy concerns and training time, Rehearsal-Free DIL (RFDIL) is more practical. Inspired by the incremental cognitive process of the human brain, we design Dual-level Concept Prototypes (DualCP) for each class to address the conflict between learning new knowledge and retaining old knowledge in RFDIL. To construct DualCP, we propose a Concept Prototype Generator (CPG) that generates both coarse-grained and fine-grained prototypes for each class. Additionally, we introduce a Coarse-to-Fine calibrator (C2F) to align image features with DualCP. Finally, we propose a Dual Dot-Regression (DDR) loss function to optimize our C2F module. Extensive experiments on the DomainNet, CDDB, and CORe50 datasets demonstrate the effectiveness of our method.
中文: 本文提出DualCP方法,通过双层级概念原型和粗到细校准器解决无回放领域增量学习中的新旧知识冲突问题,在多个数据集上验证了其有效性。
English: This paper introduces DualCP, a method using dual-level concept prototypes and a coarse-to-fine calibrator to address knowledge retention and acquisition conflicts in rehearsal-free domain-incremental learning, demonstrating effectiveness across multiple datasets.

Authors:Zefeng Zhang, Hengzhu Tang, Jiawei Sheng, Zhenyu Zhang, Yiming Ren, Zhenyang Li, Dawei Yin, Duohe Ma, Tingwen Liu
Title: Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Abstract:
Multimodal Large Language Models excel in various tasks, yet often struggle with modality bias, where the model tends to rely heavily on a single modality and overlook critical information in other modalities, which leads to incorrect focus and generating irrelevant responses. In this paper, we propose using the paradigm of preference optimization to solve the modality bias problem, including RLAIFVBias, a debiased preference optimization dataset, and a Noise Aware Preference Optimization algorithm. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities, compelling the model to rely on a specific modality when generating negative responses. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization by a negative Box Cox transformation, and dynamically adjust the algorithm noise robustness based on the evaluated noise levels in the data. Extensive experiments validate our approach, demonstrating not only its effectiveness in mitigating modality bias but also its significant role in minimizing hallucinations.
中文摘要:本文通过提出包含去偏优化数据集和噪声感知优化算法的偏好优化框架,有效解决了多模态大语言模型中的模态偏见问题,该方法通过模态扰动和噪声鲁棒性处理,显著减少了模型偏见和幻觉现象。
English Summary: This paper tackles modality bias in Multimodal Large Language Models by introducing a preference optimization framework featuring a debiased dataset and a noise-aware algorithm, which effectively reduces bias and hallucinations through strategic data perturbation and robust noise handling.

Authors:Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zheng, Jun Zhou
Title: Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM
Abstract:
Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.
中文: 本文介绍了Ling-Coder-Lite代码大模型,它采用混合专家架构和高质量数据筛选方法,在保持与同类模型相当性能的同时,将部署资源减少50%,并具备出色的运行效率。
English: This paper introduces Ling-Coder-Lite, a code LLM that uses Mixture-of-Experts architecture and high-quality data curation to achieve performance comparable to similar-sized models while reducing deployment resources by 50% and maintaining competitive efficiency.

Authors:Chi Zhang, Chengjian Feng, Feng Yan, Qiming Zhang, Mingjin Zhang, Yujie Zhong, Jing Zhang, Lin Ma
Title: InstructVEdit: A Holistic Approach for Instructional Video Editing
Abstract:
Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored. In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies. Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Project page: https://o937-blip.github.io/InstructVEdit.
中文: 本研究提出的InstructVEdit框架通过构建高质量数据集、改进模型架构和迭代优化策略,有效解决了教学视频编辑中的数据稀缺问题,在保持时序一致性的同时实现了最优性能,并展现出强大的实际应用适应性。
English: This study introduces InstructVEdit, a comprehensive framework that addresses data scarcity in instructional video editing through a curated dataset, enhanced model architecture, and iterative refinement, achieving state-of-the-art performance and robust real-world adaptability.

Authors:Tianyu Zhang, Fan Wan, Haoran Duan, Kevin W. Tong, Jingjing Deng, Yang Long
Title: FMDConv: Fast Multi-Attention Dynamic Convolution via Speed-Accuracy Trade-off
Abstract:
Spatial convolution is fundamental in constructing deep Convolutional Neural Networks (CNNs) for visual recognition. While dynamic convolution enhances model accuracy by adaptively combining static kernels, it incurs significant computational overhead, limiting its deployment in resource-constrained environments such as federated edge computing. To address this, we propose Fast Multi-Attention Dynamic Convolution (FMDConv), which integrates input attention, temperature-degraded kernel attention, and output attention to optimize the speed-accuracy trade-off. FMDConv achieves a better balance between accuracy and efficiency by selectively enhancing feature extraction with lower complexity. Furthermore, we introduce two novel quantitative metrics, the Inverse Efficiency Score and Rate-Correct Score, to systematically evaluate this trade-off. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that FMDConv reduces the computational cost by up to 49.8\% on ResNet-18 and 42.2\% on ResNet-50 compared to prior multi-attention dynamic convolution methods while maintaining competitive accuracy. These advantages make FMDConv highly suitable for real-world, resource-constrained applications.
中文: FMDConv通过集成多注意力机制优化动态卷积,在ResNet-18上实现高达49.8%的计算成本降低,同时保持竞争力的准确率,特别适合资源受限的实际应用场景。
English: FMDConv introduces a multi-attention mechanism to optimize dynamic convolution, achieving up to 49.8% computational savings on ResNet-18 while maintaining competitive accuracy for resource-constrained applications.

Authors:Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal
Title: How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study
Abstract:
AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users' loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage - across all modalities and conversation types - correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots' ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being.
中文: 尽管不同互动模式和对话类型未产生显著影响,但自愿增加使用聊天机器人的参与者普遍出现更差的心理社会结果,且用户对AI的信任和社交吸引力等特质与更高的情感依赖和问题性使用相关。
English: Despite varied interaction modes and conversation types showing no significant effects, increased voluntary chatbot usage consistently worsened psychosocial outcomes, with user traits like trust and social attraction correlating with higher emotional dependence and problematic use.

Authors:Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal
Title: How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study
Abstract:
As people increasingly seek emotional support and companionship from AI chatbots, understanding how such interactions impact mental well-being becomes critical. We conducted a four-week randomized controlled experiment (n=981, >300k messages) to investigate how interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence four psychosocial outcomes: loneliness, social interaction with real people, emotional dependence on AI, and problematic AI usage. No significant effects were detected from experimental conditions, despite conversation analyses revealing differences in AI and human behavioral patterns across the conditions. Instead, participants who voluntarily used the chatbot more, regardless of assigned condition, showed consistently worse outcomes. Individuals' characteristics, such as higher trust and social attraction towards the AI chatbot, are associated with higher emotional dependence and problematic use. These findings raise deeper questions about how artificial companions may reshape the ways people seek, sustain, and substitute human connections.
中文: 尽管不同互动模式和对话类型未产生显著影响,但自愿增加使用聊天机器人的参与者普遍出现更差的心理社会结果,且用户对AI的信任和社交吸引力等特质与更高的情感依赖和问题性使用相关。
English: Despite varied interaction modes and conversation types showing no significant effects, increased voluntary chatbot usage consistently worsened psychosocial outcomes, with user traits like trust and social attraction correlating with higher emotional dependence and problematic use.

Authors:Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara
Title: Zero-Shot Styled Text Image Generation, but Make It Autoregressive
Abstract:
Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.
中文:Emuru框架通过变分自编码器与自回归Transformer的结合,解决了风格化手写文本生成的局限性,实现了对未见风格的零样本泛化,并能生成任意长度且无背景伪影的文本图像。
English: The proposed Emuru framework overcomes limitations in styled handwritten text generation by combining a variational autoencoder with an autoregressive Transformer, enabling zero-shot generalization to novel styles and artifact-free image generation for any text length.

Authors:Yuang Feng, Shuyong Gao, Fuzhen Yan, Yicheng Song, Lingyi Hong, Junjie Hu, Wenqiang Zhang
Title: Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos
Abstract:
Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.
中文: 提出的视频伪装目标检测SRR框架采用记忆引导处理和新型注意力机制,有效整合历史视频信息,以更少参数和单次处理实现比现有方法10%的性能提升。
English: The proposed SRR framework for Video Camouflaged Object Detection utilizes memory-guided processing and a novel attention mechanism to effectively integrate historical video information, achieving a 10% performance improvement over existing methods with fewer parameters and single-pass processing.

Authors:Yinhan Zhang, Yue Ma, Bingyuan Wang, Qifeng Chen, Zeyu Wang
Title: Follow-Your-Color: Multi-Instance Sketch Colorization
Abstract:
We present Follow-Your-Color, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process of coloring each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward pass. Specifically, we first propose the self-play training strategy to address the lack of training data. Then we introduce an instance guider to feed the color of the instance. To achieve accurate color matching, we present fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed modules, Follow-Your-Color enables automatically transforming sketches into vividly-colored images with accurate consistency and multi-instance control. Experiments on our collected datasets show that our model outperforms existing methods regarding chromatic precision. Specifically, our model critically automates the colorization process with zero manual adjustments, so novice users can produce stylistically consistent artwork by providing reference instances and the original line art. Our code and additional details are available at https://yinhan-zhang.github.io/color.
中文: Follow-Your-Color 是一种基于扩散的框架,通过自博弈训练和实例引导实现多实例线稿的自动上色,确保色彩精确匹配且无需手动调整。
English: Follow-Your-Color is a diffusion-based framework that automates multi-instance sketch colorization in a single pass, using self-play training and instance guidance to ensure precise color matching and eliminate manual adjustments.

Authors:Qiyu Kang, Xuhao Li, Kai Zhao, Wenjun Cui, Yanan Zhao, Weihua Deng, Wee Peng Tay
Title: Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation
Abstract:
Fractional-order differential equations (FDEs) enhance traditional differential equations by extending the order of differential operators from integers to real numbers, offering greater flexibility in modeling complex dynamical systems with nonlocal characteristics. Recent progress at the intersection of FDEs and deep learning has catalyzed a new wave of innovative models, demonstrating the potential to address challenges such as graph representation learning. However, training neural FDEs has primarily relied on direct differentiation through forward-pass operations in FDE numerical solvers, leading to increased memory usage and computational complexity, particularly in large-scale applications. To address these challenges, we propose a scalable adjoint backpropagation method for training neural FDEs by solving an augmented FDE backward in time, which substantially reduces memory requirements. This approach provides a practical neural FDE toolbox and holds considerable promise for diverse applications. We demonstrate the effectiveness of our method in several tasks, achieving performance comparable to baseline models while significantly reducing computational overhead.
中文摘要:提出的可扩展伴随反向传播方法通过求解反向增强方程来训练神经分数阶微分方程,在保持与基线模型相当性能的同时显著降低了内存需求。
English Summary: The proposed scalable adjoint backpropagation method trains neural fractional-order differential equations by solving backward augmented equations, substantially reducing memory usage while maintaining performance comparable to baseline models.

Authors:Yiran Qin, Li Kang, Xiufeng Song, Zhenfei Yin, Xiaohong Liu, Xihui Liu, Ruimao Zhang, Lei Bai
Title: RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Abstract:
Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
中文: 本文提出组合约束和专用接口,为具身多智能体系统实现自动化数据采集,创建了RoboFactory基准,并通过模仿学习方法评估以提升系统的安全性和效率。
English: This paper introduces compositional constraints and specialized interfaces to automate data collection for embodied multi-agent systems, establishing the RoboFactory benchmark and evaluating imitation learning methods to enhance system safety and efficiency.

Authors:Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Yitao Liang, Weinan E, Linfeng Zhang, Guolin Ke
Title: Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
Abstract:
3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.
中文: Uni-3DAR提出了一种统一的自动回归框架,通过基于八叉树的粗细粒度标记器和压缩策略,实现了跨尺度的三维生成与理解,在多种任务中大幅超越现有方法并显著提升推理速度。
English: Uni-3DAR introduces a unified autoregressive framework that uses a coarse-to-fine octree tokenizer and a novel compression strategy to enable cross-scale 3D generation and understanding, achieving significant performance improvements and faster inference speeds across diverse tasks.

Authors:Ayberk Acar, Mariana Smith, Lidia Al-Zogbi, Tanner Watts, Fangjie Li, Hao Li, Nural Yilmaz, Paul Maria Scheikl, Jesse F. d'Almeida, Susheela Sharma, Lauren Branscombe, Tayfun Efe Ertop, Robert J. Webster, Ipek Oguz, Alan Kuntz, Axel Krieger, Jie Ying Wu
Title: From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction
Abstract:
Surgical automation requires precise guidance and understanding of the scene. Current methods in the literature rely on bulky depth cameras to create maps of the anatomy, however this does not translate well to space-limited clinical applications. Monocular cameras are small and allow minimally invasive surgeries in tight spaces but additional processing is required to generate 3D scene understanding. We propose a 3D mapping pipeline that uses only RGB images to create segmented point clouds of the target anatomy. To ensure the most precise reconstruction, we compare different structure from motion algorithms' performance on mapping the central airway obstructions, and test the pipeline on a downstream task of tumor resection. In several metrics, including post-procedure tissue model evaluation, our pipeline performs comparably to RGB-D cameras and, in some cases, even surpasses their performance. These promising results demonstrate that automation guidance can be achieved in minimally invasive procedures with monocular cameras. This study is a step toward the complete autonomy of surgical robots.
Chinese: 本研究提出一种仅使用RGB图像生成目标解剖结构分割点云的3D映射流程,在肿瘤切除等任务中性能媲美甚至超越RGB-D相机,为手术机器人完全自主化迈出了重要一步。
English: The study introduces a 3D mapping pipeline using only RGB images to generate segmented point clouds for surgical automation, achieving comparable or superior performance to RGB-D cameras in tasks like tumor resection and advancing toward autonomous surgical robots.

Authors:Wenjun Cui, Qiyu Kang, Xuhao Li, Kai Zhao, Wee Peng Tay, Weihua Deng, Yidong Li
Title: Neural Variable-Order Fractional Differential Equation Networks
Abstract:
Neural differential equation models have garnered significant attention in recent years for their effectiveness in machine learning applications.Among these, fractional differential equations (FDEs) have emerged as a promising tool due to their ability to capture memory-dependent dynamics, which are often challenging to model with traditional integer-order approaches.While existing models have primarily focused on constant-order fractional derivatives, variable-order fractional operators offer a more flexible and expressive framework for modeling complex memory patterns. In this work, we introduce the Neural Variable-Order Fractional Differential Equation network (NvoFDE), a novel neural network framework that integrates variable-order fractional derivatives with learnable neural networks.Our framework allows for the modeling of adaptive derivative orders dependent on hidden features, capturing more complex feature-updating dynamics and providing enhanced flexibility. We conduct extensive experiments across multiple graph datasets to validate the effectiveness of our approach.Our results demonstrate that NvoFDE outperforms traditional constant-order fractional and integer models across a range of tasks, showcasing its superior adaptability and performance.
中文: 神经变阶分数微分方程(NvoFDE)网络提出了一种创新框架,将变阶分数导数与神经网络相结合,能够自适应建模复杂记忆动态,在图数据任务中展现出优于常阶分数模型和整数模型的性能。
English: The Neural Variable-Order Fractional Differential Equation (NvoFDE) network introduces a novel framework that integrates variable-order fractional derivatives with neural networks, enabling adaptive modeling of complex memory dynamics and demonstrating superior performance over constant-order and integer models in graph-based tasks.

Authors:Haiyang Yu, Siyang Yi, Ke Niu, Minghan Zhuo, Bin Li
Title: UMIT: Unifying Medical Imaging Tasks via Vision-Language Models
Abstract:
With the rapid advancement of deep learning, particularly in the field of medical image analysis, an increasing number of Vision-Language Models (VLMs) are being widely applied to solve complex health and biomedical challenges. However, existing research has primarily focused on specific tasks or single modalities, which limits their applicability and generalization across diverse medical scenarios. To address this challenge, we propose UMIT, a unified multi-modal, multi-task VLM designed specifically for medical imaging tasks. UMIT is able to solve various tasks, including visual question answering, disease detection, and medical report generation. In addition, it is applicable to multiple imaging modalities (e.g., X-ray, CT and PET), covering a wide range of applications from basic diagnostics to complex lesion analysis. Moreover, UMIT supports both English and Chinese, expanding its applicability globally and ensuring accessibility to healthcare services in different linguistic contexts. To enhance the model's adaptability and task-handling capability, we design a unique two-stage training strategy and fine-tune UMIT with designed instruction templates. Through extensive empirical evaluation, UMIT outperforms previous methods in five tasks across multiple datasets. The performance of UMIT indicates that it can significantly enhance diagnostic accuracy and workflow efficiency, thus providing effective solutions for medical imaging applications.
中文: UMIT是一种专为医学成像设计的统一多模态多任务视觉语言模型,通过支持多种成像模态和语言,解决了特定任务模型的局限性,在诊断和工作流效率方面展现出卓越性能。
English: UMIT is a unified multi-modal, multi-task vision-language model designed for medical imaging that addresses limitations of task-specific models by supporting diverse applications across multiple imaging modalities and languages, demonstrating superior performance in diagnostics and workflow efficiency.

Authors:Le Ma, Ziyu Meng, Tengyu Liu, Yuhan Li, Ran Song, Wei Zhang, Siyuan Huang
Title: StyleLoco: Generative Adversarial Distillation for Natural Humanoid Robot Locomotion
Abstract:
Humanoid robots are anticipated to acquire a wide range of locomotion capabilities while ensuring natural movement across varying speeds and terrains. Existing methods encounter a fundamental dilemma in learning humanoid locomotion: reinforcement learning with handcrafted rewards can achieve agile locomotion but produces unnatural gaits, while Generative Adversarial Imitation Learning (GAIL) with motion capture data yields natural movements but suffers from unstable training processes and restricted agility. Integrating these approaches proves challenging due to the inherent heterogeneity between expert policies and human motion datasets. To address this, we introduce StyleLoco, a novel two-stage framework that bridges this gap through a Generative Adversarial Distillation (GAD) process. Our framework begins by training a teacher policy using reinforcement learning to achieve agile and dynamic locomotion. It then employs a multi-discriminator architecture, where distinct discriminators concurrently extract skills from both the teacher policy and motion capture data. This approach effectively combines the agility of reinforcement learning with the natural fluidity of human-like movements while mitigating the instability issues commonly associated with adversarial training. Through extensive simulation and real-world experiments, we demonstrate that StyleLoco enables humanoid robots to perform diverse locomotion tasks with the precision of expertly trained policies and the natural aesthetics of human motion, successfully transferring styles across different movement types while maintaining stable locomotion across a broad spectrum of command inputs.
中文: StyleLoco是一种新颖的两阶段框架,通过生成对抗蒸馏过程将强化学习的敏捷性与运动捕捉数据的自然流畅性相结合,使人形机器人能够执行多样化、稳定且类人的运动任务。
English: StyleLoco is a novel two-stage framework that combines reinforcement learning for agility and motion capture data for natural movement through Generative Adversarial Distillation, enabling humanoid robots to perform diverse, stable, and human-like locomotion tasks.

Authors:Ziyao Wang, Yexiao He, Zheyu Shen, Yu Li, Guoheng Sun, Myungjin Lee, Ang Li
Title: Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices
Abstract:
In recent years, Large Language Models (LLMs) have demonstrated remarkable abilities in various natural language processing tasks. However, adapting these models to specialized domains using private datasets stored on resource-constrained edge devices, such as smartphones and personal computers, remains challenging due to significant privacy concerns and limited computational resources. Existing model adaptation methods either compromise data privacy by requiring data transmission or jeopardize model privacy by exposing proprietary LLM parameters. To address these challenges, we propose Prada, a novel privacy-preserving and efficient black-box LLM adaptation system using private on-device datasets. Prada employs a lightweight proxy model fine-tuned with Low-Rank Adaptation (LoRA) locally on user devices. During inference, Prada leverages the logits offset, i.e., difference in outputs between the base and adapted proxy models, to iteratively refine outputs from a remote black-box LLM. This offset-based adaptation approach preserves both data privacy and model privacy, as there is no need to share sensitive data or proprietary model parameters. Furthermore, we incorporate speculative decoding to further speed up the inference process of Prada, making the system practically deployable on bandwidth-constrained edge devices, enabling a more practical deployment of Prada. Extensive experiments on various downstream tasks demonstrate that Prada achieves performance comparable to centralized fine-tuning methods while significantly reducing computational overhead by up to 60% and communication costs by up to 80%.
中文: Prada是一种保护隐私的系统,通过本地代理模型和输出偏移优化,在边缘设备上实现高效的黑盒大语言模型适配,其性能媲美集中式方法,同时显著降低了计算和通信开销。
English: Prada is a privacy-preserving system that enables efficient black-box LLM adaptation on edge devices by using local proxy models and logits offset refinement, achieving performance comparable to centralized methods while reducing computational and communication costs significantly.

Authors:NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang, Zu Wang, Jing Wang, Qi Wang, Jiannan Xiang, Yuqi Xie, Yinzhen Xu, Zhenjia Xu, Seonghyeon Ye, Zhiding Yu, Ao Zhang, Hao Zhang, Yizhou Zhao, Ruijie Zheng, Yuke Zhu
Title: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Abstract:
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.
中文: GR00T N1 是一款面向人形机器人的开放基础模型,采用双系统架构,将视觉语言理解与实时运动生成紧密结合,通过多样化数据集训练,在仿真测试和现实操作任务中均表现出卓越性能。
English: GR00T N1 is an open foundation model for humanoid robots, featuring a dual-system architecture that integrates vision-language interpretation with real-time motor action generation, trained on diverse datasets to excel in simulation benchmarks and real-world manipulation tasks.

Authors:Selim Jerad, Anej Svete, Jiaoda Li, Ryan Cotterell
Title: Unique Hard Attention: A Tale of Two Sides
Abstract:
Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.
中文: 本文表明,采用最左硬注意力的变换器在表达能力上弱于最右硬注意力模型,仅对应线性时序逻辑的一个片段且与软注意力等价,从而细化了变换器表达能力的研究并强调了注意力方向性的作用。
English: This paper demonstrates that transformers with leftmost-hard attention are strictly weaker than those with rightmost-hard attention, corresponding to a fragment of Linear Temporal Logic and being equivalent to soft attention, thereby refining the understanding of transformer expressivity and the role of attention directionality.

Authors:Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer
Title: Free-Lunch Color-Texture Disentanglement for Stylized Image Generation
Abstract:
Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.Code will be released at https://deepffff.github.io/sadis.github.io/.
中文: 本文提出无需调优的SADis方法,通过利用CLIP嵌入空间和正则化变换实现风格图像生成中的色彩-纹理解耦,在保持色彩一致性和纹理保真度方面优于现有最先进方法。
English: This paper presents SADis, a tuning-free method that achieves color-texture disentanglement in stylized text-to-image generation by leveraging CLIP embeddings and regularized transformations to independently control style attributes, outperforming existing approaches.

Authors:Kai Guo, Harry Shomer, Shenglai Zeng, Haoyu Han, Yu Wang, Jiliang Tang
Title: Empowering GraphRAG with Knowledge Filtering and Integration
Abstract:
In recent years, large language models (LLMs) have revolutionized the field of natural language processing. However, they often suffer from knowledge gaps and hallucinations. Graph retrieval-augmented generation (GraphRAG) enhances LLM reasoning by integrating structured knowledge from external graphs. However, we identify two key challenges that plague GraphRAG:(1) Retrieving noisy and irrelevant information can degrade performance and (2)Excessive reliance on external knowledge suppresses the model's intrinsic reasoning. To address these issues, we propose GraphRAG-FI (Filtering and Integration), consisting of GraphRAG-Filtering and GraphRAG-Integration. GraphRAG-Filtering employs a two-stage filtering mechanism to refine retrieved information. GraphRAG-Integration employs a logits-based selection strategy to balance external knowledge from GraphRAG with the LLM's intrinsic reasoning,reducing over-reliance on retrievals. Experiments on knowledge graph QA tasks demonstrate that GraphRAG-FI significantly improves reasoning performance across multiple backbone models, establishing a more reliable and effective GraphRAG framework.
中文摘要:GraphRAG-FI通过过滤图检索中的无关信息和平衡外部知识与模型内在推理,有效提升大语言模型在知识图谱问答任务中的性能表现。
English Summary: GraphRAG-FI enhances large language models by filtering irrelevant graph-retrieved information and balancing external knowledge with the model's intrinsic reasoning, significantly improving performance on knowledge graph QA tasks.

Authors:Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou
Title: Edit Transfer: Learning Image Editing via Vision In-Context Relations
Abstract:
We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
Chinese: Edit Transfer 提出了一种视觉上下文学习方法,通过单个源-目标示例学习复杂的空间变换,在少量训练数据下显著超越了现有方法在非刚性图像编辑中的表现。
English: Edit Transfer introduces a visual in-context learning approach that learns complex spatial transformations from a single source-target example, significantly outperforming existing methods in non-rigid image editing with minimal training data.

Authors:Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, Davide Taibi
Title: Generative AI for Software Architecture. Applications, Challenges, and Future Directions
Abstract:
Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.
中文: 生成式AI在软件架构中主要用于决策支持和重构,多应用于开发早期阶段,但仍面临模型精度和伦理等挑战,需进一步研究解决。
English: Generative AI is increasingly used in software architecture for decision support and reconstruction, primarily in early development stages, yet faces challenges like model precision and ethical concerns that require further research.

Authors:Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Dakai Jin
Title: UniReg: Foundation Model for Controllable Medical Image Registration
Abstract:
Learning-based medical image registration has achieved performance parity with conventional methods while demonstrating a substantial advantage in computational efficiency. However, learning-based registration approaches lack generalizability across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or organ-specific alignment. % To overcome this limitation, we propose \textbf{UniReg}, the first interactive foundation model for medical image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified framework for diverse registration scenarios, achieved through a conditional deformation field estimation within a unified registration model. This is realized through a dynamic learning paradigm that explicitly encodes: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling the generation of scenario-optimal deformation fields. % Through comprehensive experiments encompassing $90$ anatomical structures at different body regions, our UniReg model demonstrates comparable performance with contemporary state-of-the-art methodologies while achieving ~50\% reduction in required training iterations relative to the conventional learning-based paradigm. This optimization contributes to a significant reduction in computational resources, such as training time. Code and model will be available.
Chinese Summary: UniReg是首个用于医学图像配准的交互式基础模型,它将任务特定方法的精度优势与传统优化方法的泛化能力相结合,在保持与先进方法相当性能的同时,将所需训练迭代次数减少约50%。
English Summary: UniReg is an interactive foundation model for medical image registration that combines the precision of task-specific methods with the generalization of traditional approaches, achieving comparable performance to state-of-the-art methods while reducing training iterations by 50%.

Authors:Yuzheng Hu, Fan Wu, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth
Title: Empirical Privacy Variance
Abstract:
We propose the notion of empirical privacy variance and study it in the context of differentially private fine-tuning of language models. Specifically, we show that models calibrated to the same $(\varepsilon, δ)$-DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research.
中文: 我们提出经验隐私方差的概念,揭示具有相同差分隐私保证的模型会因DP-SGD超参数配置不同而产生实际隐私差异,并提出了新的超参数选择启发式方法来解决这一权衡问题。
English: We introduce empirical privacy variance to reveal that differentially private models with identical formal guarantees can exhibit significant differences in actual privacy due to hyperparameter choices in DP-SGD, proposing new selection heuristics to address this trade-off.

Authors:Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan
Title: CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
Abstract:
Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.
Chinese: 本研究提出了一种通用认证防御框架,通过创新的语义距离度量和结合特征空间噪声注入的回归认证方法,有效提升大型视觉语言模型抵御视觉越狱攻击的能力。
English: This study introduces a universal certified defense framework that enhances the robustness of large vision-language models against visual jailbreak attacks through a novel semantic distance metric and a regressed certification approach with feature-space noise injection.

Authors:Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke
Title: Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling
Abstract:
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1) grid-based space discretization; (2) grid sampling/merging; and (3) efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.
中文:SpaceFormer提出了一种基于Transformer的创新框架,通过整合分子的完整三维空间并采用网格离散化和高效位置编码技术,显著超越了现有模型,有效提升了分子预训练表征的性能。
English: SpaceFormer introduces a novel Transformer-based framework that incorporates the entire 3D molecular space, significantly outperforming prior models by leveraging grid discretization and efficient positional encoding to enhance molecular pretrained representations.

Authors:Héctor Laria, Alexandra Gomez-Villa, Jiang Qin, Muhammad Atif Butt, Bogdan Raducanu, Javier Vazquez-Corral, Joost van de Weijer, Kai Wang
Title: Leveraging Semantic Attribute Binding for Free-Lunch Color Control in Diffusion Models
Abstract:
Recent advances in text-to-image (T2I) diffusion models have enabled remarkable control over various attributes, yet precise color specification remains a fundamental challenge. Existing approaches, such as ColorPeel, rely on model personalization, requiring additional optimization and limiting flexibility in specifying arbitrary colors. In this work, we introduce ColorWave, a novel training-free approach that achieves exact RGB-level color control in diffusion models without fine-tuning. By systematically analyzing the cross-attention mechanisms within IP-Adapter, we uncover an implicit binding between textual color descriptors and reference image features. Leveraging this insight, our method rewires these bindings to enforce precise color attribution while preserving the generative capabilities of pretrained models. Our approach maintains generation quality and diversity, outperforming prior methods in accuracy and applicability across diverse object categories. Through extensive evaluations, we demonstrate that ColorWave establishes a new paradigm for structured, color-consistent diffusion-based image synthesis.
中文摘要:ColorWave是一种无需训练的新方法,通过重构交叉注意力机制在扩散模型中实现精确的RGB色彩控制,无需微调即可超越现有方法的准确性和适用性。
English Summary: ColorWave is a training-free method that enables precise RGB color control in diffusion models by rewiring cross-attention mechanisms, outperforming previous approaches in accuracy and versatility without requiring fine-tuning.

Authors:Stephen Meisenbacher, Alexandra Klymenko, Alexander Karpp, Florian Matthes
Title: Investigating User Perspectives on Differentially Private Text Privatization
Abstract:
Recent literature has seen a considerable uptick in $\textit{Differentially Private Natural Language Processing}$ (DP NLP). This includes DP text privatization, where potentially sensitive input texts are transformed under DP to achieve privatized output texts that ideally mask sensitive information $\textit{and}$ maintain original semantics. Despite continued work to address the open challenges in DP text privatization, there remains a scarcity of work addressing user perceptions of this technology, a crucial aspect which serves as the final barrier to practical adoption. In this work, we conduct a survey study with 721 laypersons around the globe, investigating how the factors of $\textit{scenario}$, $\textit{data sensitivity}$, $\textit{mechanism type}$, and $\textit{reason for data collection}$ impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts. Our findings highlight the socio-technical factors that must be considered in the study of DP NLP, opening the door to further user-based investigations going forward.
Chinese: 近期差分隐私自然语言处理研究显著增多,但用户对该技术的认知仍待探索;一项针对全球721名参与者的调查显示,尽管场景和数据敏感性等因素影响隐私偏好,用户最关注的是隐私化文本输出的实用性和连贯性。
English: Recent research in differentially private natural language processing has increased, yet user perceptions of this technology remain underexplored; a global survey of 721 participants reveals that while factors like scenario and data sensitivity influence privacy preferences, users prioritize the utility and coherence of privatized text outputs.

Authors:Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Yaowei Wang, Liqiang Nie
Title: Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Abstract:
AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. Current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D. Our approach features a progressive training pipeline with three stages: Teacher Self-Preparation, Teacher-Student Guidance, and Student Self-Practice. Additionally, we propose an instruction-tuning data EgoIT from multiple sources to strengthen the model's instruction-following capabilities, along with the EgoBench benchmark comprising eight different tasks for thorough evaluation. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.
Chinese: 为提升AI个人助理的具身理解能力,我们提出通过Ego-ExoClip数据集和渐进式训练流程将外中心视觉知识迁移至自我中心视觉的方法,在多项自我中心任务中显著优于现有模型。
English: To enhance AI personal assistants' embodied understanding, we propose a method that transfers knowledge from exocentric to egocentric vision using our Ego-ExoClip dataset and progressive training pipeline, significantly outperforming existing models in egocentric tasks.

Authors:Xin Peng, Chong Wang, Mingwei Liu, Yiling Lou, Yijian Wu
Title: Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance
Abstract:
While large language models (LLMs) have demonstrated promise in software engineering tasks like code completion and generation, their support for the maintenance of complex software systems remains limited. These models often struggle with understanding the tacit knowledge embedded in systems, such as responsibility allocation and collaboration across different modules. To address this gap, we introduce the concept and framework of \textbf{Code Digital Twin}, a conceptual representation of tacit knowledge that captures the concepts, functionalities, and design rationales behind code elements, co-evolving with the software. A code digital twin is constructed using a methodology that combines knowledge extraction from both structured and unstructured sources--such as source code, documentation, and change histories--leveraging LLMs, static analysis tools, and human expertise. This framework can empower LLMs for software maintenance tasks such as issue localization and repository-level code generation by providing tacit knowledge as contexts. Based on the proposed methodology, we explore the key challenges and opportunities involved in the continuous construction and refinement of code digital twin.
中文: 大语言模型在维护复杂软件系统时因难以理解隐性知识而受限,为此提出代码数字孪生框架,通过结合自动化提取与人工输入来捕捉此类知识,从而提升问题定位和代码生成等维护任务的能力。
English: Large language models face limitations in maintaining complex software systems due to their inability to grasp tacit knowledge, prompting the introduction of a Code Digital Twin framework that captures such knowledge through a blend of automated extraction and human input to enhance maintenance tasks like issue localization and code generation.

Authors:Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, Furong Huang
Title: PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Abstract:
As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.
中文: 本文提出PoisonedParrot,这是一种隐蔽的数据投毒攻击,能在大型语言模型未直接训练特定版权材料的情况下诱导其生成侵权内容,而现有防御方法对此攻击基本无效。
English: This paper introduces PoisonedParrot, a stealthy data poisoning attack that induces large language models to generate copyrighted content without direct training on such material, while existing defenses prove ineffective against it.

Authors:Zhiyu He, Saverio Bolognani, Florian Dörfler, Michael Muehlebach
Title: Decision-Dependent Stochastic Optimization: The Role of Distribution Dynamics
Abstract:
Distribution shifts have long been regarded as troublesome external forces that a decision-maker should either counteract or conform to. An intriguing feedback phenomenon termed decision dependence arises when the deployed decision affects the environment and alters the data-generating distribution. In the realm of performative prediction, this is encoded by distribution maps parameterized by decisions due to strategic behaviors. In contrast, we formalize an endogenous distribution shift as a feedback process featuring nonlinear dynamics that couple the evolving distribution with the decision. Stochastic optimization in this dynamic regime provides a fertile ground to examine the various roles played by dynamics in the composite problem structure. To this end, we develop an online algorithm that achieves optimal decision-making by both adapting to and shaping the dynamic distribution. Throughout the paper, we adopt a distributional perspective and demonstrate how this view facilitates characterizations of distribution dynamics and the optimality and generalization performance of the proposed algorithm. We showcase the theoretical results in an opinion dynamics context, where an opportunistic party maximizes the affinity of a dynamic polarized population, and in a recommender system scenario, featuring performance optimization with discrete distributions in the probability simplex.
中文摘要:本文将内生分布偏移定义为决策与数据分布之间的动态反馈过程,提出一种在线算法,既能适应又能塑造这些不断演变的分布,从而实现最优决策。
English Summary: The paper introduces endogenous distribution shifts as a dynamic feedback process between decisions and data distributions, proposing an online algorithm that optimally adapts to and shapes these evolving distributions.

Authors:Satyabrata Jana, Lawqueen Kanesh, Madhumita Kundu, Daniel Lokshtanov, Saket Saurabh
Title: A Quadratic Vertex Kernel and a Subexponential Algorithm for Subset-FAST
Abstract:
In the Subset Feedback Arc Set in Tournaments, Subset-FAST problem we are given as input a tournament $T$ with a vertex set $V(T)$ and an arc set $A(T)$, along with a terminal set $S \subseteq V(T)$, and an integer $ k$. The objective is to determine whether there exists a set $ F \subseteq A(T) $ with $|F| \leq k$ such that the resulting graph $T-F $ contains no cycle that includes any vertex of $S$. When $S=V(T)$ this is the classic Feedback Arc Set in Tournaments (FAST) problem. We obtain the first polynomial kernel for this problem parameterized by the solution size. More precisely, we obtain an algorithm that, given an input instance $(T, S, k)$, produces an equivalent instance $(T',S',k')$ with $k'\leq k$ and $V(T')=O(k^2)$. It was known that FAST admits a simple quadratic vertex kernel and a non-trivial linear vertex kernel. However, no such kernel was previously known for Subset-FAST. Our kernel employs variants of the most well-known reduction rules for FAST and introduces two new reduction rules to identify irrelevant vertices. As a result of our kernelization, we also obtain the first sub-exponential time FPT algorithm for Subset-FAST.
Chinese: 该研究首次为锦标赛中的子集反馈弧集问题提供了多项式核,实现了O(k²)的核规模,并通过引入新的归约规则来识别无关顶点。
English: The study presents the first polynomial kernel for the Subset Feedback Arc Set in Tournaments problem, achieving a kernel size of O(k²) and introducing novel reduction rules to eliminate irrelevant vertices.

Authors:Hanyu Zhou, Haonan Wang, Haoyue Liu, Yuxing Duan, Yi Chang, Luxin Yan
Title: Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
Abstract:
High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.
中文摘要:本文提出了一种新颖的时空融合方法,通过视觉边界定位和运动相关性融合来有效弥合帧与事件数据之间的模态差异,实现高动态场景光流的精确估计。
English Summary: This paper introduces a novel common spatiotemporal fusion method using visual boundary localization and motion correlation to effectively bridge the modality gap between frame and event data for high-dynamic scene optical flow estimation.

Authors:Tingyang Chen, Cong Fu, Kun Wang, Xiangyu Ke, Yunjun Gao, Wenchao Zhou, Yabo Ni, Anxiang Zeng
Title: Maximum Inner Product is Query-Scaled Nearest Neighbor
Abstract:
Maximum Inner Product Search (MIPS) for high-dimensional vectors is pivotal across databases, information retrieval, and artificial intelligence. Existing methods either reduce MIPS to Nearest Neighbor Search (NNS) while suffering from harmful vector space transformations, or attempt to tackle MIPS directly but struggle to mitigate redundant computations due to the absence of the triangle inequality. This paper presents a novel theoretical framework that equates MIPS with NNS without requiring space transformation, thereby allowing us to leverage advanced graph-based indices for NNS and efficient edge pruning strategies, significantly reducing unnecessary computations. Despite a strong baseline set by our theoretical analysis, we identify and address two persistent challenges to further refine our method: the introduction of the Proximity Graph with Spherical Pathway (PSP), designed to mitigate the issue of MIPS solutions clustering around large-norm vectors, and the implementation of Adaptive Early Termination (AET), which efficiently curtails the excessive exploration once an accuracy bottleneck is reached. Extensive experiments reveal the superiority of our method over existing state-of-the-art techniques in search efficiency, scalability, and practical applicability. Compared with state-of-the-art graph based methods, it achieves an average 35% speed-up in query processing and a 3x reduction in index size. Notably, our approach has been validated and deployed in the search engines of Shopee, a well-known online shopping platform. Our code and an industrial-scale dataset for offline evaluation will also be released to address the absence of e-commerce data in public benchmarks.
本文提出了一种新颖框架,通过图索引和剪枝策略将最大内积搜索直接转化为最近邻搜索,在显著提升搜索效率和减小索引规模的同时,已在电商平台的实际应用中验证了其优越性。
This paper introduces a novel framework that directly equates Maximum Inner Product Search with Nearest Neighbor Search using graph-based indices and specialized pruning strategies, achieving significant improvements in search efficiency and index size while being validated in real-world e-commerce applications.

Authors:Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yuanhang Wang, Lizhe Qi
Title: MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation
Abstract:
Adversarial Robustness Distillation (ARD) is a promising task to boost the robustness of small-capacity models with the guidance of the pre-trained robust teacher. The ARD can be summarized as a min-max optimization process, i.e., synthesizing adversarial examples (inner) & training the student (outer). Although competitive robustness performance, existing ARD methods still have issues. In the inner process, the synthetic training examples are far from the teacher's decision boundary leading to important robust information missing. In the outer process, the student model is decoupled from learning natural and robust scenarios, leading to the robustness saturation, i.e., student performance is highly susceptible to customized teacher selection. To tackle these issues, this paper proposes a general Min-Max optimization Adversarial Robustness Distillation (MMARD) method. For the inner process, we introduce the teacher's robust predictions, which drive the training examples closer to the teacher's decision boundary to explore more robust knowledge. For the outer process, we propose a structured information modeling method based on triangular relationships to measure the mutual information of the model in natural and robust scenarios and enhance the model's ability to understand multi-scenario mapping relationships. Experiments show our MMARD achieves state-of-the-art performance on multiple benchmarks. Besides, MMARD is plug-and-play and convenient to combine with existing methods.
中文: 本文提出MMARD方法,通过生成接近教师模型决策边界的对抗样本,并利用三角互信息建模多场景映射关系,有效提升学生模型的对抗鲁棒性,在多个基准测试中达到最优性能。
English: This paper introduces MMARD, a novel adversarial robustness distillation method that enhances student models by generating boundary-proximal adversarial examples and modeling multi-scenario relationships through triangular mutual information, achieving state-of-the-art performance across benchmarks.

Authors:Zhangchi Qiu, Linhao Luo, Zicheng Zhao, Shirui Pan, Alan Wee-Chung Liew
Title: Graph Retrieval-Augmented LLM for Conversational Recommendation Systems
Abstract:
Conversational Recommender Systems (CRSs) have emerged as a transformative paradigm for offering personalized recommendations through natural language dialogue. However, they face challenges with knowledge sparsity, as users often provide brief, incomplete preference statements. While recent methods have integrated external knowledge sources to mitigate this, they still struggle with semantic understanding and complex preference reasoning. Recent Large Language Models (LLMs) demonstrate promising capabilities in natural language understanding and reasoning, showing significant potential for CRSs. Nevertheless, due to the lack of domain knowledge, existing LLM-based CRSs either produce hallucinated recommendations or demand expensive domain-specific training, which largely limits their applicability. In this work, we present G-CRS (Graph Retrieval-Augmented Large Language Model for Conversational Recommender Systems), a novel training-free framework that combines graph retrieval-augmented generation and in-context learning to enhance LLMs' recommendation capabilities. Specifically, G-CRS employs a two-stage retrieve-and-recommend architecture, where a GNN-based graph reasoner first identifies candidate items, followed by Personalized PageRank exploration to jointly discover potential items and similar user interactions. These retrieved contexts are then transformed into structured prompts for LLM reasoning, enabling contextually grounded recommendations without task-specific training. Extensive experiments on two public datasets show that G-CRS achieves superior recommendation performance compared to existing methods without requiring task-specific training.
中文摘要:G-CRS是一个免训练框架,通过结合图检索增强生成与上下文学习来提升对话推荐系统性能,无需领域特定训练即可实现优越的推荐效果。
English Summary: G-CRS is a training-free framework that enhances conversational recommender systems by combining graph retrieval-augmented generation with in-context learning, achieving superior performance without domain-specific training.

Authors:Kibum Kim, Sein Kim, Hongseok Kang, Jiwan Kim, Heewoong Noh, Yeonjun In, Kanghoon Yoon, Jinoh Oh, Chanyoung Park
Title: Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems
Abstract:
Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our interesting observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, Image is all you need for LLM-based Recommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions. Through extensive experiments, we demonstrate that I-LLMRec outperforms existing methods in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.
中文: 本研究提出I-LLMRec方法,通过使用图像替代冗长文本描述来表示基于大语言模型的推荐系统中的物品,在提升效率与效果的同时降低了对噪声描述的敏感性。
English: This study introduces I-LLMRec, a novel method that utilizes images instead of lengthy text descriptions to represent items in LLM-based recommender systems, enhancing both efficiency and effectiveness while reducing sensitivity to noisy descriptions.

Authors:Jian Shen, Huai Yu, Ji Wu, Wen Yang, Gui-Song Xia
Title: LiDAR-enhanced 3D Gaussian Splatting Mapping
Abstract:
This paper introduces LiGSM, a novel LiDAR-enhanced 3D Gaussian Splatting (3DGS) mapping framework that improves the accuracy and robustness of 3D scene mapping by integrating LiDAR data. LiGSM constructs joint loss from images and LiDAR point clouds to estimate the poses and optimize their extrinsic parameters, enabling dynamic adaptation to variations in sensor alignment. Furthermore, it leverages LiDAR point clouds to initialize 3DGS, providing a denser and more reliable starting points compared to sparse SfM points. In scene rendering, the framework augments standard image-based supervision with depth maps generated from LiDAR projections, ensuring an accurate scene representation in both geometry and photometry. Experiments on public and self-collected datasets demonstrate that LiGSM outperforms comparative methods in pose tracking and scene rendering.
中文: LiGSM是一种激光雷达增强的3D高斯溅射框架,通过融合激光雷达数据、联合损失优化和深度增强渲染,显著提升了三维场景映射的精度与鲁棒性,在姿态跟踪和场景渲染方面优于现有方法。
English: LiGSM is a LiDAR-enhanced 3D Gaussian Splatting framework that integrates LiDAR data to improve mapping accuracy and robustness through joint loss optimization and depth-augmented rendering, outperforming existing methods in pose tracking and scene rendering.

Authors:Johanna P. Müller, Robert Wright, Thomas G. Day, Lorenzo Venturini, Samuel F. Budd, Hadrien Reynaud, Joseph V. Hajnal, Reza Razavi, Bernhard Kainz
Title: L-FUSION: Laplacian Fetal Ultrasound Segmentation & Uncertainty Estimation
Abstract:
Accurate analysis of prenatal ultrasound (US) is essential for early detection of developmental anomalies. However, operator dependency and technical limitations (e.g. intrinsic artefacts and effects, setting errors) can complicate image interpretation and the assessment of diagnostic uncertainty. We present L-FUSION (Laplacian Fetal US Segmentation with Integrated FoundatiON models), a framework that integrates uncertainty quantification through unsupervised, normative learning and large-scale foundation models for robust segmentation of fetal structures in normal and pathological scans. We propose to utilise the aleatoric logit distributions of Stochastic Segmentation Networks and Laplace approximations with fast Hessian estimations to estimate epistemic uncertainty only from the segmentation head. This enables us to achieve reliable abnormality quantification for instant diagnostic feedback. Combined with an integrated Dropout component, L-FUSION enables reliable differentiation of lesions from normal fetal anatomy with enhanced uncertainty maps and segmentation counterfactuals in US imaging. It improves epistemic and aleatoric uncertainty interpretation and removes the need for manual disease-labelling. Evaluations across multiple datasets show that L-FUSION achieves superior segmentation accuracy and consistent uncertainty quantification, supporting on-site decision-making and offering a scalable solution for advancing fetal ultrasound analysis in clinical settings.
中文: L-FUSION框架通过整合不确定性量化与基础模型,实现了胎儿超声图像的鲁棒分割,无需手动疾病标注即可提供可靠的异常检测和增强的诊断反馈。
English: L-FUSION is a framework that integrates uncertainty quantification and foundation models to achieve robust segmentation of fetal ultrasound images, enabling reliable abnormality detection and enhanced diagnostic feedback without manual disease labeling.

Authors:Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto
Title: Teach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection
Abstract:
Real-time object detectors like YOLO achieve exceptional performance when trained on large datasets for multiple epochs. However, in real-world scenarios where data arrives incrementally, neural networks suffer from catastrophic forgetting, leading to a loss of previously learned knowledge. To address this, prior research has explored strategies for Class Incremental Learning (CIL) in Continual Learning for Object Detection (CLOD), with most approaches focusing on two-stage object detectors. However, existing work suggests that Learning without Forgetting (LwF) may be ineffective for one-stage anchor-free detectors like YOLO due to noisy regression outputs, which risk transferring corrupted knowledge. In this work, we introduce YOLO LwF, a self-distillation approach tailored for YOLO-based continual object detection. We demonstrate that when coupled with a replay memory, YOLO LwF significantly mitigates forgetting. Compared to previous approaches, it achieves state-of-the-art performance, improving mAP by +2.1% and +2.9% on the VOC and COCO benchmarks, respectively.
Chinese: YOLO LwF是一种专为YOLO持续目标检测设计的自蒸馏方法,结合回放记忆能显著缓解灾难性遗忘,并在VOC和COCO基准测试中实现mAP显著提升,达到最先进性能。
English: YOLO LwF, a self-distillation method designed for YOLO-based continual object detection, effectively reduces catastrophic forgetting when combined with replay memory and achieves state-of-the-art results with notable mAP improvements on VOC and COCO benchmarks.

Authors:Vittorio Pippi, Matthieu Guillaumin, Silvia Cascianelli, Rita Cucchiara, Maximilian Jaritz, Loris Bazzani
Title: ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task
Abstract:
Large Multimodal Models (LMMs) are powerful tools that are capable of reasoning and understanding multimodal information beyond text and language. Despite their entrenched impact, the development of LMMs is hindered by the higher computational requirements compared to their unimodal counterparts. One of the main causes of this is the large amount of tokens needed to encode the visual input, which is especially evident for multi-image multimodal tasks. Recent approaches to reduce visual tokens depend on the visual encoder architecture, require fine-tuning the LLM to maintain the performance, and only consider single-image scenarios. To address these limitations, we propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy that combines redundant visual tokens of LMMs for high-resolution, multi-image, tasks. The core intuition behind our method is straightforward yet effective: preserve distinctive tokens while combining similar ones. We achieve this by sequentially examining visual tokens and deciding whether to merge them with others or keep them as separate entities. We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks. In addition, we push to the extreme our method by testing it on a newly-created benchmark, ComPairs, focused on multi-image comparisons where a larger amount of images and visual tokens are inputted to the LMMs. Our extensive analysis, considering several LMM architectures, demonstrates the benefits of our approach both in terms of efficiency and performance gain.
中文摘要:ToFu是一种无需训练的令牌融合策略,能有效减少大型多模态模型中冗余的视觉令牌,在保持性能的同时提升高分辨率多图像任务的效率。
English Summary: ToFu is a training-free token fusion strategy that efficiently reduces redundant visual tokens in Large Multimodal Models for high-resolution multi-image tasks while maintaining performance.

Authors:Majed Luay, Siamak Layeghy, Seyedehfaezeh Hosseininoorbin, Mohanad Sarhan, Nour Moustafa, Marius Portmann
Title: Temporal Analysis of NetFlow Datasets for Network Intrusion Detection Systems
Abstract:
This paper investigates the temporal analysis of NetFlow datasets for machine learning (ML)-based network intrusion detection systems (NIDS). Although many previous studies have highlighted the critical role of temporal features, such as inter-packet arrival time and flow length/duration, in NIDS, the currently available NetFlow datasets for NIDS lack these temporal features. This study addresses this gap by creating and making publicly available a set of NetFlow datasets that incorporate these temporal features [1]. With these temporal features, we provide a comprehensive temporal analysis of NetFlow datasets by examining the distribution of various features over time and presenting time-series representations of NetFlow features. This temporal analysis has not been previously provided in the existing literature. We also borrowed an idea from signal processing, time frequency analysis, and tested it to see how different the time frequency signal presentations (TFSPs) are for various attacks. The results indicate that many attacks have unique patterns, which could help ML models to identify them more easily.
中文: 本研究针对现有网络入侵检测的NetFlow数据集缺乏时序特征的问题,创建了包含这些特征的公开数据集,并通过首次提供的时序分析和时频信号表征揭示了不同攻击的独特模式,有助于提升机器学习模型的识别能力。
English: This study addresses the lack of temporal features in existing NetFlow datasets for network intrusion detection by creating publicly available datasets with such features, conducting a novel temporal analysis that reveals unique attack patterns through time-frequency representations to enhance machine learning model performance.

Authors:Haiyang Yu, Jinghui Lu, Yanjie Wang, Yang Li, Han Wang, Can Huang, Bin Li
Title: EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Abstract:
The advent of Large Vision-Language Models (LVLMs) has advanced the video-based tasks, such as video captioning and video understanding. Some previous research indicates that taking texts in videos as input can further improve the performance of video understanding. As a type of indispensable information in short videos or movies, subtitles can assist LVLMs to better understand videos. Most existing methods for video subtitle extraction are based on a multi-stage framework, handling each frame independently. They can hardly exploit the temporal information of videos. Although some LVLMs exhibit the robust OCR capability, predicting accurate timestamps for subtitle texts is still challenging. In this paper, we propose an End-to-end Video Subtitle Extraction method, called EVE, which consists of three modules: a vision encoder, an adapter module, and a large language model. To effectively compress the visual tokens from the vision encoder, we propose a novel adapter InterleavedVT to interleave two modalities. It contains a visual compressor and a textual region compressor. The proposed InterleavedVT exploits both the merits of average pooling and Q-Former in token compression. Taking the temporal information of videos into account, we introduce a sliding-window mechanism in the textual region compressor. To benchmark the video subtitle extraction task, we propose a large dataset ViSa including 2.5M videos. Extensive experiments on ViSa demonstrate that the proposed EVE can outperform existing open-sourced tools and LVLMs.
中文摘要:大型视觉语言模型通过整合视频字幕提升视频理解能力,本文提出的端到端EVE方法采用交错视觉文本适配器和滑动窗口机制,在新型ViSa数据集上超越了现有方法。
English Summary: Large Vision-Language Models (LVLMs) benefit video understanding by incorporating subtitle text, and the proposed end-to-end EVE method with InterleavedVT adapter and sliding-window mechanism outperforms existing approaches on the new ViSa dataset.

Authors:Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong, Wenjun Huang, Yang Ni, Ian Bryant, Nathaniel D. Bastian, Mohsen Imani
Title: PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning
Abstract:
Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.
中文: PacketCLIP是一个多模态框架,通过对比预训练和分层图神经网络推理将数据包数据与自然语言语义相结合,在加密流量分类中以95%的平均AUC、11.6%的性能提升和92%的模型压缩率实现了高效的异常检测,为资源受限环境提供可扩展、可解释的网络安全解决方案。
English: PacketCLIP is a multi-modal framework that enhances encrypted traffic classification by integrating packet data with natural language semantics through contrastive pretraining and hierarchical GNN reasoning, achieving superior anomaly detection with 95% mean AUC, 11.6% performance improvement, and 92% model size reduction for scalable, interpretable cybersecurity applications.

Authors:Manuel Barusco, Lorenzo D'Antoni, Davide Dalle Pezze, Francesco Borsatti, Gian Antonio Susto
Title: Memory Efficient Continual Learning for Edge-Based Visual Anomaly Detection
Abstract:
Visual Anomaly Detection (VAD) is a critical task in computer vision with numerous real-world applications. However, deploying these models on edge devices presents significant challenges, such as constrained computational and memory resources. Additionally, dynamic data distributions in real-world settings necessitate continuous model adaptation, further complicating deployment under limited resources. To address these challenges, we present a novel investigation into the problem of Continual Learning for Visual Anomaly Detection (CLAD) on edge devices. We evaluate the STFPM approach, given its low memory footprint on edge devices, which demonstrates good performance when combined with the Replay approach. Furthermore, we propose to study the behavior of a recently proposed approach, PaSTe, specifically designed for the edge but not yet explored in the Continual Learning context. Our results show that PaSTe is not only a lighter version of STPFM, but it also achieves superior anomaly detection performance, improving the f1 pixel performance by 10% with the Replay technique. In particular, the structure of PaSTe allows us to test it using a series of Compressed Replay techniques, reducing memory overhead by a maximum of 91.5% compared to the traditional Replay for STFPM. Our study proves the feasibility of deploying VAD models that adapt and learn incrementally on CLAD scenarios on resource-constrained edge devices.
中文: 本研究探索了边缘设备上的持续学习视觉异常检测(CLAD),证明PaSTe方法结合压缩回放技术不仅将f1像素性能提升10%,还比传统方法最高减少91.5%的内存开销,实现了在资源受限设备上的高效部署。
English: This study investigates Continual Learning for Visual Anomaly Detection (CLAD) on edge devices, demonstrating that the PaSTe method combined with Compressed Replay achieves superior performance with a 10% improvement in f1 pixel score and reduces memory overhead by up to 91.5% compared to existing approaches.

Authors:Heng Zhou, Hejia Geng, Xiangyuan Xue, Li Kang, Yiran Qin, Zhiyong Wang, Zhenfei Yin, Lei Bai
Title: ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Abstract:
Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.
中文摘要:ReSo通过结合任务图生成与奖励驱动的智能体选择流程及自动化数据合成框架,显著提升了多智能体系统的性能,在现有方法完全失败的基准测试中取得领先成果。
English Summary: ReSo enhances multi-agent systems by combining task graph generation with a reward-driven agent selection process and an automated data synthesis framework, achieving superior performance on benchmarks where other methods fail.

Authors:Wentao Chen, Lizhe Zhang, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang
Title: Memorize or Generalize? Evaluating LLM Code Generation with Evolved Questions
Abstract:
Large Language Models (LLMs) are known to exhibit a memorization phenomenon in code generation: instead of truly understanding the underlying principles of a programming problem, they tend to memorize the original prompt and its solution together in the training. Consequently, when facing variants of the original problem, their answers very likely resemble the memorized solutions and fail to generalize. In this paper, we investigate this phenomenon by designing three evolution strategies to create variants: mutation, paraphrasing, and code-rewriting. By comparing the performance and AST similarity of the LLM-generated codes before and after these three evolutions, we develop a memorization score that positively correlates with the level of memorization. As expected, as supervised fine-tuning goes on, the memorization score rises before overfitting, suggesting more severe memorization. We demonstrate that common mitigation approaches, such as prompt translation and using evolved variants as data augmentation in supervised learning and reinforcement learning, either compromise the performance or fail to alleviate the memorization issue. Therefore, memorization remains a significant challenge in LLM code generation, highlighting the need for a more effective solution.
中文: 大型语言模型在复制训练数据却无法完成语义相似任务时表现出有害记忆行为,新提出的记忆风险指数能有效衡量该现象,研究表明模型扩展会减少记忆而监督微调会加剧记忆。
English: Large language models exhibit harmful memorization when they replicate training data but fail semantically similar tasks, which the proposed Memorization Risk Index effectively measures, revealing that model scaling reduces memorization while supervised fine-tuning increases it.

Authors:Lizhe Zhang, Wentao Chen, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang
Title: Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting
Abstract:
Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.
中文: 大型语言模型在复制训练数据却无法完成语义相似任务时表现出有害记忆行为,新提出的记忆风险指数能有效衡量该现象,研究表明模型扩展会减少记忆而监督微调会加剧记忆。
English: Large language models exhibit harmful memorization when they replicate training data but fail semantically similar tasks, which the proposed Memorization Risk Index effectively measures, revealing that model scaling reduces memorization while supervised fine-tuning increases it.

Authors:Alexander Doudkin, Pat Pataranutaporn, Pattie Maes
Title: AI persuading AI vs AI persuading Humans: LLMs' Differential Effectiveness in Promoting Pro-Environmental Behavior
Abstract:
Pro-environmental behavior (PEB) is vital to combat climate change, yet turning awareness into intention and action remains elusive. We explore large language models (LLMs) as tools to promote PEB, comparing their impact across 3,200 participants: real humans (n=1,200), simulated humans based on actual participant data (n=1,200), and fully synthetic personas (n=1,200). All three participant groups faced personalized or standard chatbots, or static statements, employing four persuasion strategies (moral foundations, future self-continuity, action orientation, or "freestyle" chosen by the LLM). Results reveal a "synthetic persuasion paradox": synthetic and simulated agents significantly affect their post-intervention PEB stance, while human responses barely shift. Simulated participants better approximate human trends but still overestimate effects. This disconnect underscores LLM's potential for pre-evaluating PEB interventions but warns of its limits in predicting real-world behavior. We call for refined synthetic modeling and sustained and extended human trials to align conversational AI's promise with tangible sustainability outcomes.
中文摘要:大型语言模型在模拟测试中显示出评估环保行为干预措施的潜力,但会高估其实际效果,因此需要改进合成建模并加强人类试验。
English Summary: Large language models show promise for evaluating pro-environmental behavior interventions through simulated testing but overestimate their real-world impact, highlighting the need for improved synthetic modeling and human trials.

Authors:Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, Qingsong Wen
Title: Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement
Abstract:
Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $\sim$200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, user study questionnaires for evaluation, and other related materials have been open-sourced.
中文摘要:Time-MQA框架通过大规模TSQA数据集和增强的大语言模型,实现了跨多种时间序列任务的自然语言查询统一方法,显著提升了时序数据的推理分析能力。
English Summary: The Time-MQA framework introduces a unified approach for natural language queries across multiple time series tasks, supported by the large-scale TSQA dataset and enhanced large language models to improve time series reasoning capabilities.

Authors:Mufan Qiu, Xinyu Hu, Fengwei Zhan, Sukwon Yun, Jie Peng, Ruichen Zhang, Bhavya Kailkhura, Jiekun Yang, Tianlong Chen
Title: GRNFormer: A Biologically-Guided Framework for Integrating Gene Regulatory Networks into RNA Foundation Models
Abstract:
Foundation models for single-cell RNA sequencing (scRNA-seq) have shown promising capabilities in capturing gene expression patterns. However, current approaches face critical limitations: they ignore biological prior knowledge encoded in gene regulatory relationships and fail to leverage multi-omics signals that could provide complementary regulatory insights. In this paper, we propose GRNFormer, a new framework that systematically integrates multi-scale Gene Regulatory Networks (GRNs) inferred from multi-omics data into RNA foundation model training. Our framework introduces two key innovations. First, we introduce a pipeline for constructing hierarchical GRNs that capture regulatory relationships at both cell-type-specific and cell-specific resolutions. Second, we design a structure-aware integration framework that addresses the information asymmetry in GRNs through two technical advances: (1) A graph topological adapter using multi-head cross-attention to weight regulatory relationships dynamically, and (2) a novel edge perturbation strategy that perturb GRNs with biologically-informed co-expression links to augment graph neural network training. Comprehensive experiments have been conducted on three representative downstream tasks across multiple model architectures to demonstrate the effectiveness of GRNFormer. It achieves consistent improvements over state-of-the-art (SoTA) baselines: $3.6\%$ increase in drug response prediction correlation, $9.6\%$ improvement in single-cell drug classification AUC, and $1.1\%$ average gain in gene perturbation prediction accuracy.
中文: 本研究提出的GRNFormer框架通过整合多组学数据推断的多尺度基因调控网络,采用结构感知集成方法和分层网络构建,在多个下游任务中实现了最先进的性能表现。
English: The proposed GRNFormer framework integrates multi-scale gene regulatory networks inferred from multi-omics data into RNA foundation models, achieving state-of-the-art performance across multiple downstream tasks through structure-aware integration and hierarchical network construction.

Authors:Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu
Title: FlowDec: A flow-based full-band general audio codec with high perceptual quality
Abstract:
We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.
Chinese: FlowDec是一种基于条件流匹配方法的神经网络全频带音频编解码器,通过非对抗训练在低至4 kbit/s的比特率下实现高质量音频压缩,其重建自然度和效率均优于基于GAN的编解码器。
English: FlowDec is a neural full-band audio codec that uses non-adversarial training and a novel conditional flow matching method, achieving high-quality audio compression at rates as low as 4 kbit/s while outperforming GAN-based codecs in naturalness and efficiency.

Authors:Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li
Title: ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
Abstract:
Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.
Chinese: 本文通过识别两种依赖差异来源并引入两个即插即用的损失函数,直接解决了图像个性化中的概念耦合问题,实现了文本控制与个性化保真度之间的更优平衡。
English: This paper directly addresses the issue of conceptual coupling in image personalization by identifying two sources of dependence discrepancies and introducing two plug-and-play loss functions that achieve a superior balance between text control and personalization fidelity.

Authors:Xitong Ling, Yifeng Ping, Jiawen Li, Jing Peng, Yuxuan Chen, Minxi Ouyang, Yizhi Wang, Yonghong He, Tian Guan, Xiaoping Liu, Lianghui Zhu
Title: Multimodal Distillation-Driven Ensemble Learning for Long-Tailed Histopathology Whole Slide Images Analysis
Abstract:
Multiple Instance Learning (MIL) plays a significant role in computational pathology, enabling weakly supervised analysis of Whole Slide Image (WSI) datasets. The field of WSI analysis is confronted with a severe long-tailed distribution problem, which significantly impacts the performance of classifiers. Long-tailed distributions lead to class imbalance, where some classes have sparse samples while others are abundant, making it difficult for classifiers to accurately identify minority class samples. To address this issue, we propose an ensemble learning method based on MIL, which employs expert decoders with shared aggregators and consistency constraints to learn diverse distributions and reduce the impact of class imbalance on classifier performance. Moreover, we introduce a multimodal distillation framework that leverages text encoders pre-trained on pathology-text pairs to distill knowledge and guide the MIL aggregator in capturing stronger semantic features relevant to class information. To ensure flexibility, we use learnable prompts to guide the distillation process of the pre-trained text encoder, avoiding limitations imposed by specific prompts. Our method, MDE-MIL, integrates multiple expert branches focusing on specific data distributions to address long-tailed issues. Consistency control ensures generalization across classes. Multimodal distillation enhances feature extraction. Experiments on Camelyon+-LT and PANDA-LT datasets show it outperforms state-of-the-art methods.
中文摘要:本文提出MDE-MIL方法,通过集成学习框架结合多专家解码器和基于病理-文本对的多模态知识蒸馏技术,有效解决全切片图像分析中的长尾分布问题,在多个基准数据集上展现出优越性能。
English Summary: This paper proposes MDE-MIL, a Multiple Instance Learning method that tackles long-tailed distribution challenges in Whole Slide Image analysis through ensemble learning with expert decoders and multimodal knowledge distillation from pathology-text pairs, demonstrating superior performance on benchmark datasets.

Authors:Songlin Dong, Yuhang He, Zhengdong Zhou, Haoyu Luo, Xing Wei, Alex C. Kot, Yihong Gong
Title: Class-Independent Increment: An Efficient Approach for Multi-label Class-Incremental Learning
Abstract:
Current research on class-incremental learning primarily focuses on single-label classification tasks. However, real-world applications often involve multi-label scenarios, such as image retrieval and medical imaging. Therefore, this paper focuses on the challenging yet practical multi-label class-incremental learning (MLCIL) problem. In addition to the challenge of catastrophic forgetting, MLCIL encounters issues related to feature confusion, encompassing inter-session and intra-feature confusion. To address these problems, we propose a novel MLCIL approach called class-independent increment (CLIN). Specifically, in contrast to existing methods that extract image-level features, we propose a class-independent incremental network (CINet) to extract multiple class-level embeddings for multi-label samples. It learns and preserves the knowledge of different classes by constructing class-specific tokens. On this basis, we develop two novel loss functions, optimizing the learning of class-specific tokens and class-level embeddings, respectively. These losses aim to distinguish between new and old classes, further alleviating the problem of feature confusion. Extensive experiments on MS-COCO and PASCAL VOC datasets demonstrate the effectiveness of our method for improving recognition performance and mitigating forgetting on various MLCIL tasks.
中文:本文提出了一种名为CLIN的新型多标签类增量学习方法,通过类别特定标记和定制损失函数来缓解灾难性遗忘和特征混淆问题,在多个基准数据集上验证了其优越性能。
English: This paper introduces a novel multi-label class-incremental learning approach called CLIN, which utilizes class-specific tokens and tailored loss functions to mitigate catastrophic forgetting and feature confusion, demonstrating superior performance on benchmark datasets.

Authors:Yexiao He, Ziyao Wang, Yuning Zhang, Tingting Dan, Tianlong Chen, Guorong Wu, Ang Li
Title: NeuroSymAD: A Neuro-Symbolic Framework for Interpretable Alzheimer's Disease Diagnosis
Abstract:
Alzheimer's disease (AD) diagnosis is complex, requiring the integration of imaging and clinical data for accurate assessment. While deep learning has shown promise in brain MRI analysis, it often functions as a black box, limiting interpretability and lacking mechanisms to effectively integrate critical clinical data such as biomarkers, medical history, and demographic information. To bridge this gap, we propose NeuroSymAD, a neuro-symbolic framework that synergizes neural networks with symbolic reasoning. A neural network percepts brain MRI scans, while a large language model (LLM) distills medical rules to guide a symbolic system in reasoning over biomarkers and medical history. This structured integration enhances both diagnostic accuracy and explainability. Experiments on the ADNI dataset demonstrate that NeuroSymAD outperforms state-of-the-art methods by up to 2.91% in accuracy and 3.43% in F1-score while providing transparent and interpretable diagnosis.
中文: NeuroSymAD是一种神经符号框架,通过融合神经网络分析脑部MRI与大型语言模型指导的符号推理,在ADNI数据集上提升了阿尔茨海默病诊断的准确性和可解释性。
English: NeuroSymAD is a neuro-symbolic framework that combines neural networks for MRI analysis with symbolic reasoning guided by an LLM, improving Alzheimer's disease diagnostic accuracy and explainability on the ADNI dataset.

Authors:Fan Wan, Yuchen Li, Xueqi Qiu, Rui Sun, Leyuan Zhang, Xingyu Miao, Tianyu Zhang, Haoran Duan, Yang Long
Title: Asynchronous Personalized Federated Learning through Global Memorization
Abstract:
The proliferation of Internet of Things devices and advances in communication technology have unleashed an explosion of personal data, amplifying privacy concerns amid stringent regulations like GDPR and CCPA. Federated Learning offers a privacy preserving solution by enabling collaborative model training across decentralized devices without centralizing sensitive data. However, statistical heterogeneity from non-independent and identically distributed datasets and system heterogeneity due to client dropouts particularly those with monopolistic classes severely degrade the global model's performance. To address these challenges, we propose the Asynchronous Personalized Federated Learning framework, which empowers clients to develop personalized models using a server side semantic generator. This generator, trained via data free knowledge transfer under global model supervision, enhances client data diversity by producing both seen and unseen samples, the latter enabled by Zero-Shot Learning to mitigate dropout-induced data loss. To counter the risks of synthetic data impairing training, we introduce a decoupled model interpolation method, ensuring robust personalization. Extensive experiments demonstrate that AP FL significantly outperforms state of the art FL methods in tackling non-IID distributions and client dropouts, achieving superior accuracy and resilience across diverse real-world scenarios.
中文摘要:异步个性化联邦学习框架通过服务器端语义生成器和解耦模型插值方法,有效应对联邦学习中的统计异质性和系统异质性问题,增强客户端数据多样性并确保个性化模型的鲁棒性,在非独立同分布数据和客户端掉线情况下显著优于现有方法。
English Summary: The Asynchronous Personalized Federated Learning framework addresses statistical and system heterogeneity in federated learning by using a server-side semantic generator and decoupled model interpolation to enhance client data diversity and ensure robust personalization, achieving superior performance in non-IID and client dropout scenarios.

Authors:Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
Title: Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Abstract:
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($λ=1$, $γ=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.
中文: Open-Reasoner-Zero 是一个开源的推理导向强化学习框架,通过极简方法在多项基准测试中实现卓越性能,其训练效率比同类系统提高十倍。
English: Open-Reasoner-Zero is an open-source reasoning-focused RL training framework that achieves superior benchmark performance and efficiency with minimalist methods, requiring only one-tenth of the training steps of comparable systems.

Authors:Zhiyuan Zhou, Pranav Atreya, You Liang Tan, Karl Pertsch, Sergey Levine
Title: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World
Abstract:
Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning. Evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.
中文: AutoEval系统旨在以最少的人工干预自主评估通用机器人策略,实现持续、可扩展的测试,其结果与人工评估高度一致。
English: AutoEval is a system designed to autonomously evaluate generalist robot policies with minimal human intervention, enabling continuous, scalable testing and closely matching manual evaluation results.

Authors:Alexis Guichemerre, Soufiane Belharbi, Mohammadhadi Shateri, Luke McCaffrey, Eric Granger
Title: PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization
Abstract:
Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. WSOL only requires low-cost image-class annotations yet provides a visually interpretable classifier. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. While both strategies have made significant progress, they still face several limitations with histology images. Single-step methods can easily result in under- or over-activation due to the limited visual ROI saliency in histology images and scarce localization cues. They also face the well-known issue of asynchronous convergence between classification and localization tasks. The two-step approach is sub-optimal because it is constrained to a frozen classifier, limiting the capacity for localization. Moreover, these methods also struggle when applied to out-of-distribution (OOD) datasets. In this paper, a multi-task approach for WSOL is introduced for simultaneous training of both tasks to address the asynchronous convergence problem. In particular, localization is performed in the pixel-feature space of an image encoder that is shared with classification. This allows learning discriminant features and accurate delineation of foreground/background regions to support ROI localization and image classification. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization. Using partial-cross entropy, PixelCAM is trained using pixel pseudo-labels collected from a pretrained WSOL model. Both image and pixel-wise classifiers are trained simultaneously using standard gradient descent. In addition, our pixel classifier can easily be integrated into CNN- and transformer-based architectures without any modifications.
中文: 本文提出PixelCAM多任务弱监督目标定位方法,通过在共享特征空间中同步训练图像分类和像素级定位,有效解决了组织学图像中分类与定位任务收敛不同步的问题。
English: This paper introduces PixelCAM, a multi-task weakly supervised object localization method that simultaneously trains image classification and pixel-level localization in shared feature space to overcome asynchronous convergence issues in histology images.

Authors:Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang
Title: Training-Free Text-Guided Image Editing with Visual Autoregressive Model
Abstract:
Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.
中文: 本文提出了一种基于视觉自回归建模(VAR)的新型文本引导图像编辑框架,通过缓存机制和自适应掩码策略消除反转依赖,在保持扩散模型相当性能的同时实现高保真局部编辑和更快推理速度。
English: This paper introduces a novel text-guided image editing framework based on Visual AutoRegressive modeling (VAR) that eliminates inversion dependencies through a caching mechanism and adaptive masking, achieving high-fidelity localized edits with faster inference speeds comparable to diffusion-based methods.

Authors:Yu Zhou, Dian Zheng, Qijie Mo, Renjie Lu, Kun-Yu Lin, Wei-Shi Zheng
Title: Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
Abstract:
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
中文: DELETE是一种新颖的遗忘方法,通过掩码蒸馏技术有效擦除目标类别知识的同时保留剩余类别信息,无需依赖剩余数据即可在多种任务中实现最先进的性能。
English: DELETE is a novel unlearning method that effectively erases target classes while preserving knowledge of remaining classes through mask distillation, achieving state-of-the-art performance across multiple tasks without requiring access to remaining data.

Authors:Jingzheng Li, Xianglong Liu, Shikui Wei, Zhijun Chen, Bing Li, Qing Guo, Xianqi Yang, Yanjun Pu, Jiakai Wang
Title: Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios
Abstract:
Autonomous driving has made significant progress in both academia and industry, including performance improvements in perception task and the development of end-to-end autonomous driving systems. However, the safety and robustness assessment of autonomous driving has not received sufficient attention. Current evaluations of autonomous driving are typically conducted in natural driving scenarios. However, many accidents often occur in edge cases, also known as safety-critical scenarios. These safety-critical scenarios are difficult to collect, and there is currently no clear definition of what constitutes a safety-critical scenario. In this work, we explore the safety and robustness of autonomous driving in safety-critical scenarios. First, we provide a definition of safety-critical scenarios, including static traffic scenarios such as adversarial attack scenarios and natural distribution shifts, as well as dynamic traffic scenarios such as accident scenarios. Then, we develop an autonomous driving safety testing platform to comprehensively evaluate autonomous driving systems, encompassing not only the assessment of perception modules but also system-level evaluations. Our work systematically constructs a safety verification process for autonomous driving, providing technical support for the industry to establish standardized test framework and reduce risks in real-world road deployment.
中文摘要:本研究通过界定安全关键场景并构建自动驾驶安全测试平台,系统评估感知模块和整体系统性能,为行业建立标准化测试框架和降低实际道路部署风险提供技术支持。
English Summary: This study addresses the safety and robustness of autonomous driving by defining safety-critical scenarios and developing a comprehensive testing platform to systematically evaluate both perception modules and overall system performance.

Authors:Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, Heikki Kälviäinen
Title: AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection
Abstract:
Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.
中文摘要:本文提出AU-TTT这一新型视觉骨干网络,通过集成双向测试时训练模块和针对面部动作单元检测的特定区域扫描机制,有效解决泛化问题,并在多种数据集上实现优越性能。
English Summary: This paper introduces AU-TTT, a novel vision backbone for Facial Action Units (AU) detection that integrates bidirectional Test-Time Training blocks and an AU-specific Region of Interest scanning mechanism to address generalization challenges and achieve competitive performance across diverse datasets.

Authors:Cong Zhang, Yisheng Yang, Shilong Mu, Chuqiao Lyu, Shoujie Li, Xinyue Chai, Wenbo Ding
Title: VET: A Visual-Electronic Tactile System for Immersive Human-Machine Interaction
Abstract:
In the pursuit of deeper immersion in human-machine interaction, achieving higher-dimensional tactile input and output on a single interface has become a key research focus. This study introduces the Visual-Electronic Tactile (VET) System, which builds upon vision-based tactile sensors (VBTS) and integrates electrical stimulation feedback to enable bidirectional tactile communication. We propose and implement a system framework that seamlessly integrates an electrical stimulation film with VBTS using a screen-printing preparation process, eliminating interference from traditional methods. While VBTS captures multi-dimensional input through visuotactile signals, electrical stimulation feedback directly stimulates neural pathways, preventing interference with visuotactile information. The potential of the VET system is demonstrated through experiments on finger electrical stimulation sensitivity zones, as well as applications in interactive gaming and robotic arm teleoperation. This system paves the way for new advancements in bidirectional tactile interaction and its broader applications.
中文: 视觉电子触觉系统结合视觉触觉传感器与电刺激反馈,通过丝网印刷工艺实现无干扰的双向触觉交互,并在互动应用和机器人遥操作中展现出广阔前景。
English: The Visual-Electronic Tactile (VET) System integrates vision-based tactile sensors with electrical stimulation feedback to enable bidirectional tactile communication, eliminating interference through a screen-printing process and demonstrating potential in interactive applications.

Authors:Leander Girrbach, Stephan Alaniz, Genevieve Smith, Zeynep Akata
Title: A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models
Abstract:
With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.
中文摘要:这项大规模研究表明,文本到图像模型强化了传统性别刻板印象,女性主要被描绘为照顾者角色,而男性则多出现在技术或体力劳动场景中。
English Summary: This large-scale study reveals that text-to-image models reinforce traditional gender stereotypes, with women predominantly depicted in caregiving roles and men in technical or physical labor scenarios.

Authors:Leander Girrbach, Stephan Alaniz, Genevieve Smith, Zeynep Akata
Title: A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models
Abstract:
With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents a large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images over 5 prompt variations per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles and reflect common gender stereotypes in household roles. Women are predominantly portrayed in care and human-centered scenarios, and men in technical or physical labor scenarios.
中文摘要:这项大规模研究表明,文本到图像模型强化了传统性别刻板印象,女性主要被描绘为照顾者角色,而男性则多出现在技术或体力劳动场景中。
English Summary: This large-scale study reveals that text-to-image models reinforce traditional gender stereotypes, with women predominantly depicted in caregiving roles and men in technical or physical labor scenarios.

Authors:Yichen Li, Yulun Wu, Jinyang Liu, Zhihan Jiang, Zhuangbin Chen, Guangba Yu, Michael R. Lyu
Title: COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge
Abstract:
Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures is critical for ensuring high reliability and availability. However, prevailing automatic root cause analysis (RCA) approaches rely significantly on comprehensive runtime monitoring data, which is often not fully available in issue platforms. Recent methods leverage large language models (LLMs) to analyze issue reports, but their effectiveness is limited by incomplete or ambiguous user-provided information. To obtain more accurate and comprehensive RCA results, the core idea of this work is to extract additional diagnostic clues from code to supplement data-limited issue reports. Specifically, we propose COCA, a code knowledge enhanced root cause analysis approach for issue reports. Based on the data within issue reports, COCA intelligently extracts relevant code snippets and reconstructs execution paths, providing a comprehensive execution context for further RCA. Subsequently, COCA constructs a prompt combining historical issue reports along with profiled code knowledge, enabling the LLMs to generate detailed root cause summaries and localize responsible components. Our evaluation on datasets from five real-world distributed systems demonstrates that COCA significantly outperforms existing methods, achieving a 28.3% improvement in root cause localization and a 22.0% improvement in root cause summarization. Furthermore, COCA's performance consistency across various LLMs underscores its robust generalizability.
中文摘要:COCA通过将代码知识与问题报告相结合,显著提升了分布式系统中故障根因定位和总结的准确性。
English Summary: COCA enhances root cause analysis by integrating code knowledge with issue reports, significantly improving localization and summarization accuracy across distributed systems.

Authors:Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Title: Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance
Abstract:
Pre-trained conditional diffusion models have demonstrated remarkable potential in image editing. However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty. These issues stem from the independent editing of individual images and the inherent loss of temporal continuity during the editing process. In this paper, we introduce Follow Your Motion (FYM), a generic framework for maintaining temporal consistency in portrait editing. Specifically, given portrait images rendered by a pre-trained 3D Gaussian Splatting model, we first develop a diffusion model that intuitively and inherently learns motion trajectory changes at different scales and pixel coordinates, from the first frame to each subsequent frame. This approach ensures that temporally inconsistent edited avatars inherit the motion information from the rendered avatars. Secondly, to maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism. This mechanism assigns higher weight coefficients to landmark points in space and dynamically updates these weights based on landmark loss, achieving more consistent and refined facial expressions. Extensive experiments demonstrate that our method outperforms existing approaches in terms of temporal consistency and can be used to optimize and compensate for temporally inconsistent outputs in a range of applications, such as text-driven editing, relighting, and various other applications.
中文: 本文提出的Follow Your Motion (FYM)框架通过从首帧学习运动轨迹并采用动态加权注意力机制,有效解决了人像编辑中的时间一致性问题,尤其在保持精细面部表情连贯性方面表现优异。
English: The paper introduces Follow Your Motion (FYM), a framework that enhances temporal consistency in portrait editing by learning motion trajectories and employing a dynamic attention mechanism to maintain refined facial expressions across frames.

Authors:Hamed Babaei Giglou, Jennifer D'Souza, Oliver Karras, Sören Auer
Title: OntoAligner: A Comprehensive Modular and Robust Python Toolkit for Ontology Alignment
Abstract:
Ontology Alignment (OA) is fundamental for achieving semantic interoperability across diverse knowledge systems. We present OntoAligner, a comprehensive, modular, and robust Python toolkit for ontology alignment, designed to address current limitations with existing tools faced by practitioners. Existing tools are limited in scalability, modularity, and ease of integration with recent AI advances. OntoAligner provides a flexible architecture integrating existing lightweight OA techniques such as fuzzy matching but goes beyond by supporting contemporary methods with retrieval-augmented generation and large language models for OA. The framework prioritizes extensibility, enabling researchers to integrate custom alignment algorithms and datasets. This paper details the design principles, architecture, and implementation of the OntoAligner, demonstrating its utility through benchmarks on standard OA tasks. Our evaluation highlights OntoAligner's ability to handle large-scale ontologies efficiently with few lines of code while delivering high alignment quality. By making OntoAligner open-source, we aim to provide a resource that fosters innovation and collaboration within the OA community, empowering researchers and practitioners with a toolkit for reproducible OA research and real-world applications.
中文: OntoAligner是一个模块化的Python工具包,它通过结合传统方法和先进AI技术来改进本体对齐,为研究和实际应用提供了可扩展性和高性能。
English: OntoAligner is a modular Python toolkit that enhances ontology alignment by integrating traditional methods with advanced AI, offering scalability and high performance for both research and practical applications.

Authors:Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
Title: StarFlow: Generating Structured Workflow Outputs From Sketch Images
Abstract:
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
中文:本研究提出StarFlow框架,利用微调的视觉语言模型将手绘或数字图表自动转化为结构化工作流,通过构建多样化数据集和消融实验,有效解决了自由绘图歧义性等问题,在该任务上显著优于现有大型模型。
English: This research introduces StarFlow, a framework that leverages fine-tuned vision-language models to automatically convert hand-drawn or digital diagrams into structured workflows, overcoming challenges like visual ambiguity and outperforming existing models through curated datasets and ablation studies.

Authors:Zeyad Alghamdi, Tharindu Kumarage, Garima Agrawal, Mansooreh Karami, Ibrahim Almuteb, Huan Liu
Title: RedditESS: A Mental Health Social Support Interaction Dataset -- Understanding Effective Social Support to Refine AI-Driven Support Tools
Abstract:
Effective mental health support is crucial for alleviating psychological distress. While large language model (LLM)-based assistants have shown promise in mental health interventions, existing research often defines "effective" support primarily in terms of empathetic acknowledgments, overlooking other essential dimensions such as informational guidance, community validation, and tangible coping strategies. To address this limitation and better understand what constitutes effective support, we introduce RedditESS, a novel real-world dataset derived from Reddit posts, including supportive comments and original posters' follow-up responses. Grounded in established social science theories, we develop an ensemble labeling mechanism to annotate supportive comments as effective or not and perform qualitative assessments to ensure the reliability of the annotations. Additionally, we demonstrate the practical utility of RedditESS by using it to guide LLM alignment toward generating more context-sensitive and genuinely helpful supportive responses. By broadening the understanding of effective support, our study paves the way for advanced AI-driven mental health interventions.
中文摘要:本研究通过引入RedditESS真实数据集,将有效心理支持的定义从共情扩展到信息指导和应对策略等维度,并展示了该数据集在改进人工智能心理干预方面的实际应用价值。
English Summary: This study introduces RedditESS, a real-world dataset developed to expand the definition of effective mental health support beyond empathy to include informational guidance and coping strategies, and demonstrates its use in improving AI-driven interventions.

Authors:Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, Sumitra Ganesh
Title: Collab: Controlled Decoding using Mixture of Agents for LLM Alignment
Abstract:
Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.
Chinese: 本文提出一种协作解码方法,在推理时通过多智能体混合策略动态选择最适合的预对齐大模型生成每个词元,相比单智能体方法实现了更优的性能表现。
English: This paper introduces a collaborative decoding method that dynamically selects the most suitable pre-aligned LLM for each token during inference, achieving superior performance over single-agent approaches through a mixture-of-agents strategy.

Authors:Kota Dohi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Title: Retrieving Time-Series Differences Using Natural Language Queries
Abstract:
Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.
Chinese: 本研究提出了一种基于自然语言查询的方法,通过定义六个差异特征并采用对比学习模型来检索时间序列数据对,实验结果显示其整体mAP得分高达0.994。
English: This study introduces a natural language query-based method for retrieving time-series data pairs by defining six difference characteristics and employing a contrastive learning model, achieving a high mAP score of 0.994 in experiments.

Authors:Yueying Gao, Dongliang Chang, Bingyao Yu, Haotian Qin, Muxi Diao, Lei Chen, Kongming Liang, Zhanyu Ma
Title: Towards Generalizable Forgery Detection and Reasoning
Abstract:
Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO's attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.
中文: 本研究提出了统一的伪造检测与推理任务(FDR-Task)及FakeReasoning框架,通过多模态大语言模型和新型双分支视觉编码器,在多种生成模型的AI图像检测与解释中实现了强大泛化能力和卓越性能。
English: This study introduces the unified Forgery Detection and Reasoning task (FDR-Task) and the FakeReasoning framework, leveraging Multi-Modal Large Language Models and a novel dual-branch visual encoder to achieve robust generalization and superior performance in detecting and explaining AI-generated images across diverse generative models.

Authors:Yuyin Chen, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Yifei Zhan, Xianpeng Lang
Title: StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency
Abstract:
Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication.
中文摘要:StyledStreets是一种多风格街道模拟器,通过混合嵌入、不确定性感知渲染和统一参数化建模,实现具有时空一致性的逼真指令驱动场景编辑。
English Summary: StyledStreets is a multi-style street simulator that enables photorealistic, instruction-driven scene editing with guaranteed spatial-temporal consistency through hybrid embedding, uncertainty-aware rendering, and unified parametric modeling.

Authors:Zhenghan Yu, Xinyu Hu, Xiaojun Wan
Title: CFunModel: A "Funny" Language Model Capable of Chinese Humor Generation and Processing
Abstract:
Humor plays a significant role in daily language communication. With the rapid development of large language models (LLMs), natural language processing has made significant strides in understanding and generating various genres of texts. However, most LLMs exhibit poor performance in generating and processing Chinese humor. In this study, we introduce a comprehensive Chinese humor-related dataset, the Chinese Fun Set (CFunSet). This dataset aggregates existing Chinese humor datasets and includes over 20,000 jokes collected from Tieba-JokeBar, a Chinese online platform known for joke sharing. The resulting corpus comprises more than 160,000 entries. Leveraging CFunSet, we developed the Chinese Fun Model (CFunModel), the first large language model designed to handle various Chinese humor-related tasks including Crosstalk Response Selection, Humor Recognition, Joke Generation, etc. Experimental results demonstrate that CFunModel outperforms popular large language models in these tasks. Our CFunSet is available at https://huggingface.co/datasets/ZhenghanYU/CFunSet and CFunModel is available at https://huggingface.co/ZhenghanYU/CFunModel. A demostration video of our work is available at https://youtu.be/MOsISOJ66Ms.
Chinese: 本研究推出了包含逾16万条目的中文幽默数据集CFunSet,并开发了首个专用于中文幽默任务的大语言模型CFunModel,该模型在幽默识别与生成等任务中表现优于现有模型。
English: This study introduces CFunSet, a comprehensive Chinese humor dataset with over 160,000 entries, and CFunModel, the first large language model tailored for Chinese humor tasks, which outperforms existing models in humor recognition and generation.

Authors:Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, Shanghang Zhang
Title: MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation
Abstract:
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot's current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.
中文摘要:提出的MoLe-VLA模型通过时空感知路由器和认知自蒸馏机制动态激活部分网络层,在降低计算成本的同时提升了机器人任务执行成功率。
English Summary: The proposed MoLe-VLA model dynamically activates select LLM layers using a spatial-temporal router and knowledge distillation, achieving higher task success rates with significantly reduced computational costs compared to standard models.

Authors:Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu
Title: L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
Abstract:
As Large Language Models (LLMs) show their capabilities across various applications, training customized LLMs has become essential for modern enterprises. However, due to the complexity of LLM training, which requires massive computational resources and extensive training time, failures are inevitable during the training process. These failures result in considerable waste of resource and time, highlighting the critical need for effective and efficient failure diagnosis to reduce the cost of LLM training. In this paper, we present the first empirical study on the failure reports of 428 LLM training failures in our production Platform-X between May 2023 and April 2024. Our study reveals that hardware and user faults are the predominant root causes, and current diagnosis processes rely heavily on training logs. Unfortunately, existing log-based diagnostic methods fall short in handling LLM training logs. Considering the unique features of LLM training, we identify three distinct patterns of LLM training logs: cross-job, spatial, and temporal patterns. We then introduce our Log-based Large-scale LLM training failure diagnosis framework, L4, which can automatically extract failure-indicating information (i.e., log events, nodes, stages, and iterations) from extensive training logs, thereby reducing manual effort and facilitating failure recovery. Experimental results on real-world datasets show that L4 outperforms existing approaches in identifying failure-indicating logs and localizing faulty nodes. Furthermore, L4 has been applied in Platform-X and demonstrated its effectiveness in enabling accurate and efficient failure diagnosis.
中文: 大语言模型训练故障主要由硬件和用户错误引起,需要高效的诊断方法,因此开发了L4框架,通过自动化日志分析来提升故障识别与恢复能力。
English: Large Language Model training failures, primarily caused by hardware and user errors, necessitate efficient diagnosis methods, leading to the development of the L4 framework that automates log analysis to enhance failure identification and recovery.

Authors:Zhiyao Ren, Yibing Zhan, Baosheng Yu, Dacheng Tao
Title: Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Abstract:
Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.
中文: 本文提出自动逆向提示优化方法(ARPO),通过迭代式梯度提示优化从参考图像解码文本描述,能高效生成高质量图像并支持创意编辑。
English: This paper introduces Automatic Reverse Prompt Optimization (ARPO), a method that iteratively refines prompts to decode textual descriptions from reference images, enabling high-quality image generation and creative editing.

Authors:Kelaiti Xiao, Liang Yang, Paerhati Tulajiang, Hongfei Lin
Title: VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs
Abstract:
This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.
中文摘要:VisualQuest是一个新颖的图像数据集,旨在通过抽象化、符号化的视觉内容评估大型语言模型的多模态推理能力,强调领域知识和逻辑推断在视觉识别任务中的核心作用。
English Summary: VisualQuest is a novel dataset challenging large language models to interpret stylized and symbolic imagery, highlighting the critical role of domain knowledge and reasoning in multimodal AI performance.

Authors:Juncen Guo, Xiaoguang Zhu, Liangyu Teng, Hao Yang, Jing Liu, Yang Liu, Liang Song
Title: Adaptive Weighted Parameter Fusion with CLIP for Class-Incremental Learning
Abstract:
Class-incremental Learning (CIL) enables the model to incrementally absorb knowledge from new classes and build a generic classifier across all previously encountered classes. When the model optimizes with new classes, the knowledge of previous classes is inevitably erased, leading to catastrophic forgetting. Addressing this challenge requires making a trade-off between retaining old knowledge and accommodating new information. However, this balancing process often requires sacrificing some information, which can lead to a partial loss in the model's ability to discriminate between classes. To tackle this issue, we design the adaptive weighted parameter fusion with Contrastive Language-Image Pre-training (CLIP), which not only takes into account the variability of the data distribution of different tasks, but also retains all the effective information of the parameter matrix to the greatest extent. In addition, we introduce a balance factor that can balance the data distribution alignment and distinguishability of adjacent tasks. Experimental results on several traditional benchmarks validate the superiority of the proposed method.
Chinese: 类增量学习在吸收新类知识时易发生灾难性遗忘,而我们采用自适应加权参数融合与CLIP的方法,在权衡旧知识保留与新信息整合的同时,最大程度保持了参数矩阵的有效信息。
English: Class-incremental learning faces catastrophic forgetting when integrating new classes, but our method using adaptive weighted parameter fusion with CLIP effectively balances old knowledge retention and new information integration while preserving parameter matrix information.

Authors:Xiaoqing Zhang, Hanfeng Shi, Xiangyu Li, Haili Ye, Tao Xu, Na Li, Yan Hu, Fan Lv, Jiangfan Chen, Jiang Liu
Title: Adaptive Wavelet Filters as Practical Texture Feature Amplifiers for Parkinson's Disease Screening in OCT
Abstract:
Parkinson's disease (PD) is a prevalent neurodegenerative disorder globally. The eye's retina is an extension of the brain and has great potential in PD screening. Recent studies have suggested that texture features extracted from retinal layers can be adopted as biomarkers for PD diagnosis under optical coherence tomography (OCT) images. Frequency domain learning techniques can enhance the feature representations of deep neural networks (DNNs) by decomposing frequency components involving rich texture features. Additionally, previous works have not exploited texture features for automated PD screening in OCT. Motivated by the above analysis, we propose a novel Adaptive Wavelet Filter (AWF) that serves as the Practical Texture Feature Amplifier to fully leverage the merits of texture features to boost the PD screening performance of DNNs with the aid of frequency domain learning. Specifically, AWF first enhances texture feature representation diversities via channel mixer, then emphasizes informative texture feature representations with the well-designed adaptive wavelet filtering token mixer. By combining the AWFs with the DNN stem, AWFNet is constructed for automated PD screening. Additionally, we introduce a novel Balanced Confidence (BC) Loss by mining the potential of sample-wise predicted probabilities of all classes and class frequency prior, to further boost the PD screening performance and trustworthiness of AWFNet. The extensive experiments manifest the superiority of our AWFNet and BC over state-of-the-art methods in terms of PD screening performance and trustworthiness.
中文摘要:本研究提出AWFNet模型,通过自适应小波滤波器和平衡置信度损失增强深度神经网络,利用视网膜OCT图像的纹理特征提升帕金森病筛查性能和可信度。
English Summary: This study introduces AWFNet, a deep neural network enhanced by an Adaptive Wavelet Filter and Balanced Confidence Loss, which improves Parkinson's disease screening performance and trustworthiness by amplifying texture features from retinal OCT images.

Authors:Renpu Liu, Peng Wang, Donghao Li, Cong Shen, Jing Yang
Title: A Shared Low-Rank Adaptation Approach to Personalized RLHF
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.
中文摘要:本文通过引入低秩自适应技术,在个性化人类反馈强化学习中实现了从有限数据中高效学习多样化奖励模型,同时捕捉人类偏好中的共享结构和个体差异。
English Summary: This paper introduces Low-Rank Adaptation (LoRA) to enhance personalized Reinforcement Learning from Human Feedback by efficiently learning diverse reward models from limited data while capturing both shared and individual preference structures.

Authors:Zhongyu Yang, Jun Chen, Dannong Xu, Junjie Fei, Xiaoqian Shen, Liangbing Zhao, Chun-Mei Feng, Mohamed Elhoseiny
Title: WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation
Abstract:
Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%-29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. Our code and examples are available at https://wikiautogen.github.io/
中文:WikiAutoGen提出了一种新颖的多模态维基百科式文章自动生成系统,通过整合相关图像与文本并采用多视角自反思机制来提升事实准确性和连贯性,在WikiSeek基准测试中比先前方法性能提升8%-29%。
English: WikiAutoGen introduces a novel automated system for generating multimodal Wikipedia-style articles by integrating relevant images and text, enhanced with a multi-perspective self-reflection mechanism to improve factual accuracy and coherence, outperforming previous methods by 8%-29% on the new WikiSeek benchmark.

Authors:Yang Liu, Hongjin Wang, Zepu Wang, Xiaoguang Zhu, Jing Liu, Peng Sun, Rui Tang, Jianwei Du, Victor C. M. Leung, Liang Song
Title: CRCL: Causal Representation Consistency Learning for Anomaly Detection in Surveillance Videos
Abstract:
Video Anomaly Detection (VAD) remains a fundamental yet formidable task in the video understanding community, with promising applications in areas such as information forensics and public safety protection. Due to the rarity and diversity of anomalies, existing methods only use easily collected regular events to model the inherent normality of normal spatial-temporal patterns in an unsupervised manner. Previous studies have shown that existing unsupervised VAD models are incapable of label-independent data offsets (e.g., scene changes) in real-world scenarios and may fail to respond to light anomalies due to the overgeneralization of deep neural networks. Inspired by causality learning, we argue that there exist causal factors that can adequately generalize the prototypical patterns of regular events and present significant deviations when anomalous instances occur. In this regard, we propose Causal Representation Consistency Learning (CRCL) to implicitly mine potential scene-robust causal variable in unsupervised video normality learning. Specifically, building on the structural causal models, we propose scene-debiasing learning and causality-inspired normality learning to strip away entangled scene bias in deep representations and learn causal video normality, respectively. Extensive experiments on benchmarks validate the superiority of our method over conventional deep representation learning. Moreover, ablation studies and extension validation show that the CRCL can cope with label-independent biases in multi-scene settings and maintain stable performance with only limited training data available.
中文: 视频异常检测面临现实场景变化和轻微异常识别的挑战,因此提出因果表示一致性学习(CRCL),通过解耦场景偏差并利用因果因素来稳健建模正常视频模式,从而在多场景和有限数据条件下实现优越性能。
English: Video Anomaly Detection faces challenges in handling real-world scene variations and subtle anomalies, leading to the proposal of Causal Representation Consistency Learning (CRCL) to robustly model normal video patterns by disentangling scene biases and leveraging causal factors for improved performance across diverse scenarios.

Authors:Bing Cao, Baoshuo Cai, Changqing Zhang, Qinghua Hu
Title: Dig2DIG: Dig into Diffusion Information Gains for Image Fusion
Abstract:
Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.
中文: 提出的Dig2DIG框架通过量化扩散去噪过程中各模态的动态信息增益,构建了具有理论保证的图像融合方法,在多种场景中实现了更优的融合质量与推理效率。
English: The proposed Dig2DIG framework dynamically quantifies modality-specific information gains during diffusion denoising, establishing a theoretically guaranteed fusion method that achieves superior performance across multiple scenarios.

Authors:Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, Jianxin Li
Title: Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
Abstract:
Modern vision-language models (VLMs) develop patch embedding and convolution backbone within vector space, especially Euclidean ones, at the very founding. When expanding VLMs to a galaxy scale for understanding astronomical phenomena, the integration of spherical space for planetary orbits and hyperbolic spaces for black holes raises two formidable challenges. a) The current pre-training model is confined to Euclidean space rather than a comprehensive geometric embedding. b) The predominant architecture lacks suitable backbones for anisotropic physical geometries. In this paper, we introduced Galaxy-Walker, a geometry-aware VLM, for the universe-level vision understanding tasks. We proposed the geometry prompt that generates geometry tokens by random walks across diverse spaces on a multi-scale physical graph, along with a geometry adapter that compresses and reshapes the space anisotropy in a mixture-of-experts manner. Extensive experiments demonstrate the effectiveness of our approach, with Galaxy-Walker achieving state-of-the-art performance in both galaxy property estimation ($R^2$ scores up to $0.91$) and morphology classification tasks (up to $+0.17$ F1 improvement in challenging features), significantly outperforming both domain-specific models and general-purpose VLMs.
Chinese: 本文提出Galaxy-Walker几何感知视觉语言模型,通过在多尺度物理图上采用几何提示和适配器解决天文应用中欧几里得空间的局限性,在星系属性估计和形态分类任务中取得了最先进的性能。
English: This paper introduces Galaxy-Walker, a geometry-aware vision-language model that addresses the limitations of Euclidean space in astronomical applications by incorporating geometry prompts and adapters for multi-scale physical graphs, achieving state-of-the-art results in galaxy property estimation and morphology classification.

Authors:Jiacheng Yao, Wei Shi, Wei Xu, Zhaohui Yang, A. Lee Swindlehurst, Dusit Niyato
Title: Byzantine-Resilient Over-the-Air Federated Learning under Zero-Trust Architecture
Abstract:
Over-the-air computation (AirComp) has emerged as an essential approach for enabling communication-efficient federated learning (FL) over wireless networks. Nonetheless, the inherent analog transmission mechanism in AirComp-based FL (AirFL) intensifies challenges posed by potential Byzantine attacks. In this paper, we propose a novel Byzantine-robust FL paradigm for over-the-air transmissions, referred to as federated learning with secure adaptive clustering (FedSAC). FedSAC aims to protect a portion of the devices from attacks through zero trust architecture (ZTA) based Byzantine identification and adaptive device clustering. By conducting a one-step convergence analysis, we theoretically characterize the convergence behavior with different device clustering mechanisms and uneven aggregation weighting factors for each device. Building upon our analytical results, we formulate a joint optimization problem for the clustering and weighting factors in each communication round. To facilitate the targeted optimization, we propose a dynamic Byzantine identification method using historical reputation based on ZTA. Furthermore, we introduce a sequential clustering method, transforming the joint optimization into a weighting optimization problem without sacrificing the optimality. To optimize the weighting, we capitalize on the penalty convex-concave procedure (P-CCP) to obtain a stationary solution. Numerical results substantiate the superiority of the proposed FedSAC over existing methods in terms of both test accuracy and convergence rate.
中文摘要:本文提出FedSAC框架,通过零信任架构的拜占庭攻击识别与自适应设备聚类技术,有效提升无线联邦学习系统的安全性和收敛性能。
English Summary: The paper introduces FedSAC, a Byzantine-robust federated learning framework for wireless networks that uses zero trust architecture for attack identification and adaptive clustering to enhance security and performance.

Authors:Li Liu, Shuzhou Sun, Shuaifeng Zhi, Fan Shi, Zhen Liu, Janne Heikkilä, Yongxiang Liu
Title: A Causal Adjustment Module for Debiasing Scene Graph Generation
Abstract:
While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model's ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.
中文: 针对场景图生成中的模型偏差问题,本文通过因果推理建模对象和对象对分布的偏斜,提出基于中介变量的因果链模型和因果调整模块,有效纠正预测偏差并增强零样本关系识别能力,实验证明该方法在多个基准上取得了领先的召回率表现。
English: Recent debiasing methods in Scene Graph Generation often overlook the root causes of model bias beyond long-tail relationships, so this paper uses causal inference to model skewed distributions and introduces a Mediator-based Causal Chain Model with a Causal Adjustment Module to correct biases and improve zero-shot relationship recognition, achieving state-of-the-art results in experiments.

Authors:Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem
Title: 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
中文: 本文提出了首个评估多模态大语言模型对四维物体理解能力的基准4D-Bench,发现现有模型在时序理解方面存在明显不足,即使是GPT-4o等先进模型在四维物体问答任务中的表现也远低于人类水平。
English: This paper introduces 4D-Bench, the first benchmark to evaluate Multimodal Large Language Models' understanding of 4D objects, revealing significant gaps in temporal comprehension and showing that even advanced models like GPT-4o substantially underperform humans in 4D object question answering.

Authors:Houqiang Zhong, Shaocheng Shen, Ke Cai, Zhenglong Wu, Jiangchao Yao, Yuan Cheng, Xuefei Li, Xiaoyun Zhang, Li Song, Qiang Hu
Title: Serial Low-rank Adaptation of Vision Transformer
Abstract:
Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.
中文:Serial LoRA是一种新型视觉Transformer参数高效微调方法,通过将共享低秩矩阵与注意力机制串行组合,在保持性能相当的同时将参数减少至原方法的1/4。
English: Serial LoRA is a novel parameter-efficient fine-tuning method for vision transformers that reduces parameters by 75% while maintaining comparable performance through serial composition of shared low-rank matrices with attention mechanisms.

Authors:Jiacheng Yao, Wei Xu, Guangxu Zhu, Zhaohui Yang, Kaibin Huang, Dusit Niyato
Title: Quantized Analog Beamforming Enabled Multi-task Federated Learning Over-the-air
Abstract:
Over-the-air computation (AirComp) has recently emerged as a pivotal technique for communication-efficient federated learning (FL) in resource-constrained wireless networks. Though AirComp leverages the superposition property of multiple access channels for computation, it inherently limits its ability to manage inter-task interference in multi-task computing. In this paper, we propose a quantized analog beamforming scheme at the receiver to enable simultaneous multi-task FL. Specifically, inspiring by the favorable propagation and channel hardening properties of large-scale antenna arrays, a targeted analog beamforming method in closed form is proposed for statistical interference elimination. Analytical results reveal that the interference power vanishes by an order of $\mathcal{O}\left(1/N_r\right)$ with the number of analog phase shifters, $N_r$, irrespective of their quantization precision. Numerical results demonstrate the effectiveness of the proposed analog beamforming method and show that the performance upper bound of ideal learning without errors can be achieved by increasing the number of low-precision analog phase shifters.
Chinese Summary: 本文提出了一种量化模拟波束成形方案,利用大规模天线阵列消除统计干扰,实现多任务联邦学习的同步执行,并通过低精度移相器达到接近理想的学习性能。
English Summary: This paper introduces a quantized analog beamforming scheme that enables simultaneous multi-task federated learning by leveraging large-scale antenna arrays to eliminate statistical interference, achieving near-ideal performance with low-precision phase shifters.

Authors:Sharon Lin, Krishnamurthy, Dvijotham, Jamie Hayes, Chongyang Shi, Ilia Shumailov, Shuang Song
Title: Large Language Models Can Verbatim Reproduce Long Malicious Sequences
Abstract:
Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.
中文: 本研究探讨了大型语言模型的后门攻击,证明通过LoRA微调可操控模型生成含硬编码密钥等精确长序列输出,并发现良性微调能有效消除此类后门。
English: This study explores backdoor attacks on Large Language Models (LLMs), demonstrating that they can be manipulated to generate precise, lengthy outputs like hardcoded cryptographic keys through LoRA fine-tuning, while also showing that benign fine-tuning can effectively remove these backdoors.

Authors:Giacomo Savazzi, Eugenio Lomurno, Cristian Sbrolli, Agnese Chiatti, Matteo Matteucci
Title: Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation
Abstract:
As machine learning models increase in scale and complexity, obtaining sufficient training data has become a critical bottleneck due to acquisition costs, privacy constraints, and data scarcity in specialised domains. While synthetic data generation has emerged as a promising alternative, a notable performance gap remains compared to models trained on real data, particularly as task complexity grows. Concurrently, Neuro-Symbolic methods, which combine neural networks' learning strengths with symbolic reasoning's structured representations, have demonstrated significant potential across various cognitive tasks. This paper explores the utility of Neuro-Symbolic conditioning for synthetic image dataset generation, focusing specifically on improving the performance of Scene Graph Generation models. The research investigates whether structured symbolic representations in the form of scene graphs can enhance synthetic data quality through explicit encoding of relational constraints. The results demonstrate that Neuro-Symbolic conditioning yields significant improvements of up to +2.59% in standard Recall metrics and +2.83% in No Graph Constraint Recall metrics when used for dataset augmentation. These findings establish that merging Neuro-Symbolic and generative approaches produces synthetic data with complementary structural information that enhances model performance when combined with real data, providing a novel approach to overcome data scarcity limitations even for complex visual reasoning tasks.
中文: 本研究证明,通过将符号推理与神经网络结合的神经符号调节方法,能够利用场景图的关系约束显着提升合成图像数据质量,在场景图生成任务中关键指标最高提升2.83%,为复杂视觉任务中的数据稀缺问题提供了有效解决方案。
English: This study demonstrates that Neuro-Symbolic conditioning, which integrates symbolic reasoning with neural networks, significantly enhances synthetic image data quality for Scene Graph Generation by encoding relational constraints, achieving performance improvements of up to 2.83% in key metrics and offering an effective solution to data scarcity in complex visual tasks.

Authors:Fouad Makiyeh, Huy-Dung Nguyen, Patrick Chareyre, Ramin Hasani, Marc Blanchon, Daniela Rus
Title: Enhancing Steering Estimation with Semantic-Aware GNNs
Abstract:
Steering estimation is a critical task in autonomous driving, traditionally relying on 2D image-based models. In this work, we explore the advantages of incorporating 3D spatial information through hybrid architectures that combine 3D neural network models with recurrent neural networks (RNNs) for temporal modeling, using LiDAR-based point clouds as input. We systematically evaluate four hybrid 3D models, all of which outperform the 2D-only baseline, with the Graph Neural Network (GNN) - RNN model yielding the best results. To reduce reliance on LiDAR, we leverage a pretrained unified model to estimate depth from monocular images, reconstructing pseudo-3D point clouds. We then adapt the GNN-RNN model, originally designed for LiDAR-based point clouds, to work with these pseudo-3D representations, achieving comparable or even superior performance compared to the LiDAR-based model. Additionally, the unified model provides semantic labels for each point, enabling a more structured scene representation. To further optimize graph construction, we introduce an efficient connectivity strategy where connections are predominantly formed between points of the same semantic class, with only 20\% of inter-class connections retained. This targeted approach reduces graph complexity and computational cost while preserving critical spatial relationships. Finally, we validate our approach on the KITTI dataset, achieving a 71% improvement over 2D-only models. Our findings highlight the advantages of 3D spatial information and efficient graph construction for steering estimation, while maintaining the cost-effectiveness of monocular images and avoiding the expense of LiDAR-based systems.
中文: 本研究通过开发融合激光雷达或单目图像生成的伪3D点云时空数据的混合3D模型,结合基于语义的优化图构建方法,显著提升了自动驾驶转向估计性能,在保持成本效益的同时大幅超越传统2D方法。
English: This research enhances autonomous driving steering estimation by developing hybrid 3D models that integrate spatial and temporal data from LiDAR or pseudo-3D point clouds derived from monocular images, achieving significant performance gains over traditional 2D methods through optimized semantic-based graph construction.

Authors:Junjie Hu, Shuyong Gao, Qianyu Guo, Yan Wang, Qishan Wang, Yuang Feng, Wenqiang Zhang
Title: AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process
Abstract:
Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as "refinement" and "layering" processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.
中文: 本研究提出了一种自监督框架,通过逆向去除笔画并模拟人类创作过程,从任意图像生成绘画流程,无需依赖人工标注数据集。
English: The study introduces a self-supervised framework that generates drawing processes from any image by reversing strokes and simulating human-like creation, eliminating the need for human-annotated datasets.

Authors:Shayne Longpre, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca S. Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan
Title: In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
Abstract:
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
Chinese: 摘要指出通用人工智能系统缺陷报告的基础设施严重不足,并提出三项关键措施——标准化缺陷报告、广泛覆盖的披露计划与法律保护、改进的协调机制,以应对越狱等日益增长的风险,从而提升系统安全性和问责性。
English: The abstract highlights the underdeveloped infrastructure for reporting flaws in general-purpose AI systems and proposes three key interventions—standardized flaw reports, broad-scoped disclosure programs with legal protections, and improved coordination infrastructure—to enhance safety and accountability amid rising risks like jailbreaks.

Authors:Jian Liang, Wenke Huang, Guancheng Wan, Qu Yang, Mang Ye
Title: LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
Abstract:
While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance. To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge. Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights. Extensive experimental results demonstrate that even at very high degree of sparsity ($\le$ 5%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.
中文摘要:提出的LoRASculpt方法通过稀疏更新和冲突缓解机制消除多模态大语言模型视觉指令调优中的有害冗余,即使在极高稀疏度下也能同时提升泛化能力和下游任务性能。
English Summary: The proposed LoRASculpt method eliminates harmful redundancy in MLLMs' visual instruction tuning through sparse updates and conflict mitigation, enhancing both generalization and task performance even at high sparsity levels.

Authors:Li Fan, Wei Shen, Jing Yang, Cong Shen
Title: Decision Feedback In-Context Learning for Wireless Symbol Detection
Abstract:
Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts without model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high detection accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose DEcision Feedback IN-ContExt Detection (DEFINED) as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts as pseudo-labels to improve the detection for subsequent symbols. We further establish an error lower bound and provide theoretical insights into the model's generalization under channel distribution mismatch. Extensive experiments across a broad range of wireless settings demonstrate that a small Transformer trained with DEFINED achieves significant performance improvements over conventional methods, in some cases only needing a single pilot pair to achieve similar performance to the latter with more than 4 pilot pairs.
中文: 提出的DEFINED接收机通过在上下文学习中引入决策反馈机制,利用有限导频数据直接进行符号检测,相比传统方法以极少的导频需求实现了更优的性能。
English: The proposed DEFINED receiver leverages a decision feedback mechanism in in-context learning to enhance symbol detection using limited pilot data, achieving superior performance over traditional methods with minimal pilot requirements.

Authors:Tian Yi Lim, Boyang Sun, Marc Pollefeys, Hermann Blum
Title: Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors
Abstract:
(Visual) Simultaneous Localization and Mapping (SLAM) remains a fundamental challenge in enabling autonomous systems to navigate and understand large-scale environments. Traditional SLAM approaches struggle to balance efficiency and accuracy, particularly in large-scale settings where extensive computational resources are required for scene reconstruction and Bundle Adjustment (BA). However, this scene reconstruction, in the form of sparse pointclouds of visual landmarks, is often only used within the SLAM system because navigation and planning methods require different map representations. In this work, we therefore investigate a more scalable Visual SLAM (VSLAM) approach without reconstruction, mainly based on approaches for two-view loop closures. By restricting the map to a sparse keyframed pose graph without dense geometry representations, our '2GO' system achieves efficient optimization with competitive absolute trajectory accuracy. In particular, we find that recent advancements in image matching and monocular depth priors enable very accurate trajectory optimization from two-view edges. We conduct extensive experiments on diverse datasets, including large-scale scenarios, and provide a detailed analysis of the trade-offs between runtime, accuracy, and map size. Our results demonstrate that this streamlined approach supports real-time performance, scales well in map size and trajectory duration, and effectively broadens the capabilities of VSLAM for long-duration deployments to large environments.
中文摘要:本研究提出"2GO"系统,通过采用稀疏关键帧位姿图和双视图闭环检测,在无需传统场景重建的情况下实现可扩展的视觉SLAM,在大规模环境中以实时性能获得具有竞争力的轨迹精度。
English Summary: This study introduces '2GO', a scalable Visual SLAM system that eliminates traditional scene reconstruction by using a sparse keyframe pose graph and two-view loop closures, achieving efficient real-time performance with competitive trajectory accuracy in large-scale environments.

Authors:Dincy R Arikkat, Vinod P., Rafidha Rehiman K. A., Serena Nicolazzo, Marco Arazzi, Antonino Nocera, Mauro Conti
Title: DroidTTP: Mapping Android Applications with TTP for Cyber Threat Intelligence
Abstract:
The widespread adoption of Android devices for sensitive operations like banking and communication has made them prime targets for cyber threats, particularly Advanced Persistent Threats (APT) and sophisticated malware attacks. Traditional malware detection methods rely on binary classification, failing to provide insights into adversarial Tactics, Techniques, and Procedures (TTPs). Understanding malware behavior is crucial for enhancing cybersecurity defenses. To address this gap, we introduce DroidTTP, a framework mapping Android malware behaviors to TTPs based on the MITRE ATT&CK framework. Our curated dataset explicitly links MITRE TTPs to Android applications. We developed an automated solution leveraging the Problem Transformation Approach (PTA) and Large Language Models (LLMs) to map applications to both Tactics and Techniques. Additionally, we employed Retrieval-Augmented Generation (RAG) with prompt engineering and LLM fine-tuning for TTP predictions. Our structured pipeline includes dataset creation, hyperparameter tuning, data augmentation, feature selection, model development, and SHAP-based model interpretability. Among LLMs, Llama achieved the highest performance in Tactic classification with a Jaccard Similarity of 0.9583 and Hamming Loss of 0.0182, and in Technique classification with a Jaccard Similarity of 0.9348 and Hamming Loss of 0.0127. However, the Label Powerset XGBoost model outperformed LLMs, achieving a Jaccard Similarity of 0.9893 for Tactic classification and 0.9753 for Technique classification, with a Hamming Loss of 0.0054 and 0.0050, respectively. While XGBoost showed superior performance, the narrow margin highlights the potential of LLM-based approaches in TTP classification.
中文: DroidTTP框架通过将安卓恶意软件行为映射至MITRE ATT&CK攻击框架的TTPs,弥补了传统检测方法的不足;虽然XGBoost模型在分类任务中表现优于大语言模型,但微小性能差距揭示了大语言模型在网络安全领域的应用潜力。
English: The DroidTTP framework addresses limitations in traditional Android malware detection by mapping malicious behaviors to MITRE ATT&CK TTPs, where XGBoost outperformed LLMs in classification tasks but the narrow performance gap demonstrates LLMs' potential in cybersecurity applications.

Authors:Philip Huang, Ruixuan Liu, Shobhit Aggarwal, Changliu Liu, Jiaoyang Li
Title: APEX-MR: Multi-Robot Asynchronous Planning and Execution for Cooperative Assembly
Abstract:
Compared to a single-robot workstation, a multi-robot system offers several advantages: 1) it expands the system's workspace, 2) improves task efficiency, and, more importantly, 3) enables robots to achieve significantly more complex and dexterous tasks, such as cooperative assembly. However, coordinating the tasks and motions of multiple robots is challenging due to issues, e.g., system uncertainty, task efficiency, algorithm scalability, and safety concerns. To address these challenges, this paper studies multi-robot coordination and proposes APEX-MR, an asynchronous planning and execution framework designed to safely and efficiently coordinate multiple robots to achieve cooperative assembly, e.g., LEGO assembly. In particular, APEX-MR provides a systematic approach to post-process multi-robot tasks and motion plans to enable robust asynchronous execution under uncertainty. Experimental results demonstrate that APEX-MR can significantly speed up the execution time of many long-horizon LEGO assembly tasks by 48% compared to sequential planning and 36% compared to synchronous planning on average. To further demonstrate performance, we deploy APEX-MR in a dual-arm system to perform physical LEGO assembly. To our knowledge, this is the first robotic system capable of performing customized LEGO assembly using commercial LEGO bricks. The experimental results demonstrate that the dual-arm system, with APEX-MR, can safely coordinate robot motions, efficiently collaborate, and construct complex LEGO structures. Our project website is available at https://intelligent-control-lab.github.io/APEX-MR/.
中文: 本文提出了APEX-MR异步规划与执行框架,通过系统化处理多机器人任务与运动规划,在不确定性环境下实现了高效安全的协作装配,实验证明其能大幅提升乐高搭建等任务的执行效率。
English: This paper introduces APEX-MR, an asynchronous planning and execution framework that enhances multi-robot coordination for complex tasks like LEGO assembly, achieving significant efficiency gains and safe operation despite system uncertainties.

Authors:Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar
Title: UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Abstract:
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.
中文摘要:UI-Vision是首个用于桌面环境自主代理评估的开源基准,提供详细的人类操作数据和多粒度任务,评估发现现有模型在专业软件使用方面存在显著不足。
English Summary: UI-Vision is the first open-source benchmark for evaluating autonomous agents in desktop environments, providing detailed human demonstrations and tasks to assess performance, revealing significant limitations in current models for professional software use.

Authors:Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu
Title: Personalized Attacks of Social Engineering in Multi-turn Conversations: LLM Agents for Simulation and Detection
Abstract:
The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the SE attack mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims' personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.
中文: 大型语言模型驱动的聊天机器人快速发展加剧了社交媒体上的社会工程攻击风险,为此开发了SE-VSim框架模拟多轮攻击场景,并推出SE-OmniGuard系统根据受害者个性特征提供个性化防护。
English: The rapid evolution of LLM-powered chatbots heightens social engineering risks on social media, prompting the development of SE-VSim to simulate multi-turn attack scenarios and SE-OmniGuard to provide personalized protection based on victims' personality traits.

Authors:Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, Xin Peng
Title: LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents
Abstract:
Automated testing for REST APIs has become essential for ensuring the correctness and reliability of modern web services. While existing approaches primarily focus on detecting server crashes and error codes, they often overlook logical issues that arise due to evolving business logic and domain-specific requirements. To address this limitation, we propose LogiAgent, a novel approach for logical testing of REST systems. Built upon a large language model (LLM)-driven multi-agent framework, LogiAgent integrates a Test Scenario Generator, API Request Executor, and API Response Validator to collaboratively generate, execute, and validate API test scenarios. Unlike traditional testing methods that focus on status codes like 5xx, LogiAgent incorporates logical oracles that assess responses based on business logic, ensuring more comprehensive testing. The system is further enhanced by an Execution Memory component that stores historical API execution data for contextual consistency. We conduct extensive experiments across 12 real-world REST systems, demonstrating that LogiAgent effectively identifies 234 logical issues with an accuracy of 66.19%. Additionally, it basically excels in detecting server crashes and achieves superior test coverage compared to four state-of-the-art REST API testing tools. An ablation study confirms the significant contribution of LogiAgent's memory components to improving test coverage.
中文: LogiAgent采用基于大语言模型的多智能体框架,通过集成测试场景生成、执行与验证组件,结合业务逻辑验证器解决REST API逻辑测试问题,实验证明其在识别逻辑错误和提升测试覆盖率方面表现卓越。
English: LogiAgent introduces a multi-agent framework powered by large language models to address logical issues in REST API testing, integrating scenario generation, execution, and validation with business logic oracles, and demonstrating high effectiveness in identifying issues and achieving superior coverage in experiments.

Authors:Ivor van der Hoog, Lara Ost, Eva Rotenberg, Daniel Rutschmann
Title: Efficient Greedy Discrete Subtrajectory Clustering
Abstract:
We cluster a set of trajectories T using subtrajectories of T. Clustering quality may be measured by the number of clusters, the number of vertices of T that are absent from the clustering, and by the Fréchet distance between subtrajectories in a cluster. A $Δ$-cluster of T is a cluster ${\mathcal{P}}$ of subtrajectories of T with a centre $P \in {\mathcal{P}}$ with complexity $\ell$, where all subtrajectories in ${\mathcal{P}}$ have Fréchet distance at most $Δ$ to $P$. Buchin, Buchin, Gudmundsson, Löffler and Luo present two $O(n^2 + n m \ell)$-time algorithms: SC($\max$, $\ell$, $Δ$, T) computes a single $Δ$-cluster where $P$ has at least $\ell$ vertices and maximises the cardinality $m$ of ${\mathcal{P}}$. SC($m$, $\max$, $Δ$, T) computes a single $Δ$-cluster where ${\mathcal{P}}$ has cardinality $m$ and maximises the complexity $\ell$ of $P$. We use such maximum-cardinality clusters in a greedy clustering algorithm. We provide an efficient implementation of SC($\max$, $\ell$, $Δ$, T) and SC($m$, $\max$, $Δ$, T) that significantly outperforms previous implementations. We use these functions as a subroutine in a greedy clustering algorithm, which performs well when compared to existing subtrajectory clustering algorithms on real-world data. Finally, we observe that, for fixed $Δ$ and T, these two functions always output a point on the Pareto front of some bivariate function $θ(\ell, m)$. We design a new algorithm PSC($Δ$, T) that in $O( n^2 \log^4 n)$ time computes a $2$-approximation of this Pareto front. This yields a broader set of candidate clusters, with comparable quality. We show that using PSC($Δ$, T) as a subroutine improves the clustering quality and performance even further.
中文: 本研究提出了高效的轨迹子轨迹聚类算法,通过贪心算法和帕累托前沿近似,显著提升了聚类质量和性能表现。
English: This study introduces efficient algorithms for clustering trajectory subtrajectories, demonstrating improved performance and clustering quality through greedy algorithms and Pareto front approximation.

Authors:Kaixin Shen, Ruijie Quan, Jiaxu Miao, Jun Xiao, Yi Yang
Title: TarPro: Targeted Protection against Malicious Image Editing
Abstract:
The rapid advancement of image editing techniques has raised concerns about their misuse for generating Not-Safe-for-Work (NSFW) content. This necessitates a targeted protection mechanism that blocks malicious edits while preserving normal editability. However, existing protection methods fail to achieve this balance, as they indiscriminately disrupt all edits while still allowing some harmful content to be generated. To address this, we propose TarPro, a targeted protection framework that prevents malicious edits while maintaining benign modifications. TarPro achieves this through a semantic-aware constraint that only disrupts malicious content and a lightweight perturbation generator that produces a more stable, imperceptible, and robust perturbation for image protection. Extensive experiments demonstrate that TarPro surpasses existing methods, achieving a high protection efficacy while ensuring minimal impact on normal edits. Our results highlight TarPro as a practical solution for secure and controlled image editing.
中文: TarPro是一种针对性保护框架,通过语义感知约束和稳定扰动技术,在有效阻止恶意图像编辑的同时保持正常编辑功能,其保护效果和实用性均优于现有方法。
English: TarPro is a targeted protection framework that effectively blocks malicious image edits while preserving normal editability through semantic-aware constraints and stable perturbations, outperforming existing methods in balancing security and functionality.

Authors:Wei Chen, Han Ding, Meng Yuan, Zhao Zhang, Deqing Wang, Fuzhen Zhuang
Title: Bridging Social Psychology and LLM Reasoning: Conflict-Aware Meta-Review Generation via Cognitive Alignment
Abstract:
The rapid growth of scholarly submissions has overwhelmed traditional peer review systems, driving the need for intelligent automation to preserve scientific rigor. While large language models (LLMs) show promise in automating manuscript critiques, their ability to synthesize high-stakes meta-reviews, which require conflict-aware reasoning and consensus derivation, remains underdeveloped. Existing methods fail to effectively handle conflicting viewpoints within differing opinions, and often introduce additional cognitive biases, such as anchoring effects and conformity bias.To overcome these limitations, we propose the Cognitive Alignment Framework (CAF), a dual-process architecture that transforms LLMs into adaptive scientific arbitrators. By operationalizing Kahneman's dual-process theory, CAF introduces a three-step cognitive pipeline: review initialization, incremental integration, and cognitive alignment.Empirical validation shows that CAF outperforms existing LLM-based methods, with sentiment consistency gains reaching up to 19.47\% and content consistency improving by as much as 12.95\%.
中文: 认知对齐框架(CAF)被提出作为一种双过程架构,通过将大型语言模型转化为适应性科学仲裁者,有效处理元评审中的观点冲突并减少认知偏差,实证验证显示其在情感和内容一致性方面取得了显著提升。
English: The Cognitive Alignment Framework (CAF) is introduced as a dual-process architecture that enhances large language models to function as adaptive scientific arbitrators, effectively addressing conflicting viewpoints and reducing cognitive biases in meta-reviews, with empirical results showing significant improvements in sentiment and content consistency.

Authors:Tanmay Vilas Samak, Chinmay Vilas Samak, Julia Brault, Cori Harber, Kirsten McCane, Jonathon Smereka, Mark Brudnak, David Gorsich, Venkat Krovi
Title: A Systematic Digital Engineering Approach to Verification & Validation of Autonomous Ground Vehicles in Off-Road Environments
Abstract:
The engineering community currently encounters significant challenges in the systematic development and validation of autonomy algorithms for off-road ground vehicles. These challenges are posed by unusually high test parameters and algorithmic variants. In order to address these pain points, this work presents an optimized digital engineering framework that tightly couples digital twin simulations with model-based systems engineering (MBSE) and model-based design (MBD) workflows. The efficacy of the proposed framework is demonstrated through an end-to-end case study of an autonomous light tactical vehicle (LTV) performing visual servoing to drive along a dirt road and reacting to any obstacles or environmental changes. The presented methodology allows for traceable requirements engineering, efficient variant management, granular parameter sweep setup, systematic test-case definition, and automated execution of the simulations. The candidate off-road autonomy algorithm is evaluated for satisfying requirements against a battery of 128 test cases, which is procedurally generated based on the test parameters (times of the day and weather conditions) and algorithmic variants (perception, planning, and control sub-systems). Finally, the test results and key performance indicators are logged, and the test report is generated automatically. This then allows for manual as well as automated data analysis with traceability and tractability across the digital thread.
中文: 本研究提出了一种优化的数字工程框架,将数字孪生仿真与基于模型的系统工程和基于模型的设计相结合,通过包含128个程序生成测试场景的完整案例研究,系统性地开发和验证了越野自动驾驶车辆的算法。
English: This study introduces an optimized digital engineering framework integrating digital twin simulations with MBSE and MBD to systematically develop and validate off-road autonomous vehicle algorithms, demonstrated through a comprehensive case study involving 128 procedurally generated test scenarios.

Authors:Ziyu Wang, Elahe Khatibi, Kianoosh Kazemi, Iman Azimi, Sanaz Mousavi, Shaista Malik, Amir M. Rahmani
Title: TransECG: Leveraging Transformers for Explainable ECG Re-identification Risk Analysis
Abstract:
Electrocardiogram (ECG) signals are widely shared across multiple clinical applications for diagnosis, health monitoring, and biometric authentication. While valuable for healthcare, they also carry unique biometric identifiers that pose privacy risks, especially when ECG data shared across multiple entities. These risks are amplified in shared environments, where re-identification threats can compromise patient privacy. Existing deep learning re-identification models prioritize accuracy but lack explainability, making it challenging to understand how the unique biometric characteristics encoded within ECG signals are recognized and utilized for identification. Without these insights, despite high accuracy, developing secure and trustable ECG data-sharing frameworks remains difficult, especially in diverse, multi-source environments. In this work, we introduce TransECG, a Vision Transformer (ViT)-based method that uses attention mechanisms to pinpoint critical ECG segments associated with re-identification tasks like gender, age, and participant ID. Our approach demonstrates high accuracy (89.9% for gender, 89.9% for age, and 88.6% for ID re-identification) across four real-world datasets with 87 participants. Importantly, we provide key insights into ECG components such as the R-wave, QRS complex, and P-Q interval in re-identification. For example, in the gender classification, the R wave contributed 58.29% to the model's attention, while in the age classification, the P-R interval contributed 46.29%. By combining high predictive performance with enhanced explainability, TransECG provides a robust solution for privacy-conscious ECG data sharing, supporting the development of secure and trusted healthcare data environment.
中文摘要:心电图信号在医疗应用中至关重要,但携带的生物特征标识符存在隐私风险,而TransECG模型通过高精度重识别和可解释性分析关键心电成分,为隐私保护型数据共享提供了解决方案。
English Summary: ECG signals, while essential for healthcare applications, carry biometric identifiers that create privacy risks, and the proposed TransECG model addresses these by achieving high re-identification accuracy while providing explainable insights into critical ECG components.

Authors:Keying Guo, Ruisi He, Mi Yang, Yuxin Zhang, Bo Ai, Haoxiang Zhang, Jiahui Han, Ruifeng Chen
Title: A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling
Abstract:
Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-term evolution of the channel. This paper emphasizes the generation of long-term dynamic channel to fully capture evolution of non-stationary channel properties. The generated channel not only reflects temporal dynamics but also ensures consistent stationarity. We propose a hybrid deep learning framework that combines conditional generative adversarial networks (CGAN) with long short-term memory (LSTM) networks. A stationarity-constrained approach is designed to ensure temporal correlation of the generated time-series channel. This method can generate channel with required temporal non-stationarity. The model is validated by comparing channel statistical features, and the results show that the generated channel is in good agreement with raw channel and provides good performance in terms of non-stationarity.
中文: 本文提出了一种结合条件生成对抗网络和长短期记忆网络的混合深度学习框架,能够生成具有时间相关性和平稳性的长期动态信道,有效捕捉非平稳信道演化特性,并通过统计特征对比验证了其优越性能。
English: This paper introduces a hybrid deep learning framework combining CGAN and LSTM to generate long-term dynamic channels that capture non-stationary evolution while maintaining temporal correlation and stationarity, validated through statistical feature comparisons.

Authors:Tong Zhou, Shijin Duan, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu
Title: ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction
Abstract:
Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. **ProDiF** quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that **ProDiF** reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65\%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security.
Chinese: ProDiF是一种新颖框架,通过操控未受保护内存中的关键权重并将其安全存储在可信执行环境中,有效保护预训练模型免受提取攻击,将源域准确率降至接近随机水平,并降低跨域可迁移性达74.65%。
English: ProDiF is a novel framework that secures pre-trained models against extraction attacks by manipulating critical weights in unsecured memory and storing them securely in a TEE, effectively reducing source-domain accuracy to near-random levels and cross-domain transferability by 74.65%.

Authors:Chen Liu, Peike Li, Liying Yang, Dadong Wang, Lincheng Li, Xin Yu
Title: Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
Abstract:
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.
中文摘要:本文提出了一种新颖的视听分割框架,通过音频引导的模态对齐模块聚焦于音频相关区域,并利用不确定性估计模块降低高不确定性区域的预测误差,有效解决了视听对应模糊问题,在多个挑战性场景中超越了现有最优方法的性能。
English Summary: This paper introduces a novel audio-visual segmentation framework addressing ambiguous audio-visual correspondences through an audio-guided modality alignment module that focuses on audio-relevant regions and an uncertainty estimation module that reduces prediction errors in high-uncertainty areas, achieving superior performance over existing methods.

Authors:Chen Liu, Liying Yang, Peike Li, Dadong Wang, Lincheng Li, Xin Yu
Title: Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
Abstract:
Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph{\ie}, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets.
中文摘要:DDESeg框架通过重构混合音频的语义内容并引入动态消除机制,有效解决了声音引导物体分割中的特征混淆和视听匹配难题,实现了更精准的多模态感知对齐。
English Summary: The proposed DDESeg framework addresses audio feature confusion and audio-visual matching challenges in sound-guided object segmentation by reconstructing semantic audio content and implementing dynamic feature elimination to ensure precise multimodal alignment.

Authors:Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng
Title: ViSpeak: Visual Instruction Feedback in Streaming Videos
Abstract:
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.
中文: 本文提出视觉指令反馈新任务,通过ViSpeak模型和配套数据集使多模态模型能够识别视觉指令并启动交互,为实时视频理解研究奠定基础。
English: This paper introduces Visual Instruction Feedback, a novel task for streaming video understanding that enables models to recognize visual cues and initiate interactions, supported by the ViSpeak model and datasets to advance real-time multimodal AI applications.

Authors:Jiafan He, Quanquan Gu
Title: Variance-Dependent Regret Lower Bounds for Contextual Bandits
Abstract:
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\tilde{O}(d\sqrt{K})$ regret bound to $\tilde{O}(d\sqrt{\sum_{k=1}^Kσ_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $σ^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\textbf{elu}}$ and total variance budget $Λ$, there exists an instance with $\sum_{k=1}^Kσ_k^2\leq Λ$ for which any algorithm incurs a variance-dependent lower bound of $Ω(\sqrt{d_{\textbf{elu}}Λ})$. However, this lower bound has a $\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $Λ$ and does not apply to a general variance sequence $\{σ_1^2,\ldots,σ_K^2\}$. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $Ω(d \sqrt{\sum_{k=1}^Kσ_k^2 }/\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $σ_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $σ_k^2$ before observing the decision set $\mathcal{D}_k$, a similar lower bound of $Ω(d\sqrt{ \sum_{k=1}^Kσ_k^2} /\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.
本文针对线性上下文赌博机问题建立了与现有上界在多项式对数因子内匹配的方差相关下界,通过考虑预定义和自适应两种方差序列设置,解决了先前研究的局限性。
This paper establishes variance-dependent lower bounds for linear contextual bandits that match existing upper bounds up to logarithmic factors, addressing limitations in prior work by considering both prefixed and adaptive variance sequences.

Authors:Zhenguang Liu, Chao Shuai, Shaojing Fan, Ziping Dong, Jinwu Hu, Zhongjie Ba, Kui Ren
Title: Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models
Abstract:
Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.
Chinese: 本文提出CoprGuard,一种频域水印框架,利用扩散生成图像中保留的统计特性,即使水印样本极少也能有效防止模型训练中的未授权使用。
English: This paper introduces CoprGuard, a frequency domain watermarking framework that leverages the preserved statistical properties in diffusion-generated images to effectively protect against unauthorized use in model training, even with minimal watermarked samples.

Authors:Leonidas Gkimisis, Igor Pontes Duff, Pawan Goyal, Peter Benner
Title: On the representation of energy-preserving quadratic operators with application to Operator Inference
Abstract:
In this work, we investigate a skew-symmetric parameterization for energy-preserving quadratic operators. Earlier, [Goyal et al., 2023] proposed this parameterization to enforce energy-preservation for quadratic terms in the context of dynamical system data-driven inference. We here prove that every energy-preserving quadratic term can be equivalently formulated using a parameterization of the corresponding operator via skew-symmetric matrix blocks. Based on this main finding, we develop an algorithm to compute an equivalent quadratic operator with skew-symmetric sub-matrices, given an arbitrary energy-preserving operator. Consequently, we employ the skew-symmetric sub-matrix representation in the framework of non-intrusive reduced-order modeling (ROM) via Operator Inference (OpInf) for systems with an energy-preserving nonlinearity. To this end, we propose a sequential, linear least-squares (LS) problems formulation for the inference task, to ensure energy-preservation of the data-driven quadratic operator. The potential of this approach is indicated by the numerical results for a 2D Burgers' equation benchmark, compared to classical OpInf. The inferred system dynamics are accurate, while the corresponding operators are faithful to the underlying physical properties of the system.
中文: 本研究证明了所有能量守恒二次算子均可通过斜对称矩阵块表示,并开发了一种顺序最小二乘算法用于在降阶建模中推断此类算子,通过二维Burgers方程基准测试的精确结果验证了其有效性。
English: This study demonstrates that all energy-preserving quadratic operators can be represented using skew-symmetric matrix blocks and develops a sequential least-squares algorithm for inferring such operators in reduced-order modeling, validated through accurate results on a 2D Burgers' equation benchmark.

Authors:Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang
Title: Open-World Skill Discovery from Unsegmented Demonstrations
Abstract:
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in https://craftjarvis.github.io/SkillDiscovery.
中文: 我们开发了一种自监督的技能边界检测算法,通过检测预测误差峰值将未标记视频分割为技能一致的片段,在《我的世界》中显著提升了智能体在原子技能和分层任务中的表现。
English: We developed a self-supervised Skill Boundary Detection algorithm that segments unlabeled videos into skill-consistent segments by detecting prediction error spikes, significantly improving agent performance in both atomic and hierarchical tasks in Minecraft.

Authors:Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li
Title: CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Abstract:
This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
中文: CameraCtrl II 框架通过先增强单个视频片段的动态内容,再扩展至大范围视角的无缝探索,实现了基于相机控制的动态场景合成,显著拓宽了空间探索能力。
English: CameraCtrl II is a framework that progressively enhances dynamic scene generation by first improving video clip dynamics and then enabling seamless, wide-ranging camera-controlled explorations through iterative trajectory inputs.

Authors:Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
Title: VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Abstract:
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
中文: VisualWebInstruct通过从网络资源构建多样化数据集,解决了多模态推理数据匮乏的问题,显著提升了模型在复杂任务上的性能。
English: VisualWebInstruct addresses the scarcity of reasoning-focused multimodal data by creating a diverse dataset from web sources, significantly boosting model performance on complex tasks.

Authors:Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Ningyi Xu, Hang Zhao
Title: Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback
Abstract:
Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of human driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving preferences. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves PDMS of 93.95 on NavSim benchmark, significantly exceeding other methods. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
Chinese Summary: TrajHF提出了一种基于人类反馈的优化框架,通过多条件去噪器和强化学习改进生成式轨迹模型,使其在保持安全可行的同时更好地匹配个性化驾驶风格,在NavSim基准测试中达到领先水平。
English Summary: TrajHF introduces a human feedback-driven framework that enhances generative trajectory models to better align with personalized driving styles while maintaining safety and feasibility constraints, achieving state-of-the-art performance on the NavSim benchmark.

Authors:Derun Li, Changye Li, Yue Wang, Jianwei Ren, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, Ningyi Xu, Hang Zhao
Title: Learning Personalized Driving Styles via Reinforcement Learning from Human Feedback
Abstract:
Generating human-like and adaptive trajectories is essential for autonomous driving in dynamic environments. While generative models have shown promise in synthesizing feasible trajectories, they often fail to capture the nuanced variability of personalized driving styles due to dataset biases and distributional shifts. To address this, we introduce TrajHF, a human feedback-driven finetuning framework for generative trajectory models, designed to align motion planning with diverse driving styles. TrajHF incorporates multi-conditional denoiser and reinforcement learning with human feedback to refine multi-modal trajectory generation beyond conventional imitation learning. This enables better alignment with human driving preferences while maintaining safety and feasibility constraints. TrajHF achieves performance comparable to the state-of-the-art on NavSim benchmark. TrajHF sets a new paradigm for personalized and adaptable trajectory generation in autonomous driving.
Chinese Summary: TrajHF提出了一种基于人类反馈的优化框架,通过多条件去噪器和强化学习改进生成式轨迹模型,使其在保持安全可行的同时更好地匹配个性化驾驶风格,在NavSim基准测试中达到领先水平。
English Summary: TrajHF introduces a human feedback-driven framework that enhances generative trajectory models to better align with personalized driving styles while maintaining safety and feasibility constraints, achieving state-of-the-art performance on the NavSim benchmark.

Authors:Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Wenhan Luo
Title: MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion
Abstract:
Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.
Chinese: MaterialMVP提出了一种端到端模型,通过创新的注意力机制和训练策略,从3D网格和图像提示生成PBR纹理,在可扩展的3D资产创建中实现了卓越的一致性和质量。
English: MaterialMVP introduces an end-to-end model that generates PBR textures from 3D meshes and image prompts using innovative attention mechanisms and training strategies, achieving superior consistency and quality in scalable 3D asset creation.

Authors:Yifeng Cai, Ziqi Zhang, Mengyu Yao, Junlin Liu, Xiaoke Zhao, Xinyi Fu, Ruoyu Li, Zhe Li, Xiangqun Chen, Yao Guo, Ding Li
Title: I Can Tell Your Secrets: Inferring Privacy Attributes from Mini-app Interaction History in Super-apps
Abstract:
Super-apps have emerged as comprehensive platforms integrating various mini-apps to provide diverse services. While super-apps offer convenience and enriched functionality, they can introduce new privacy risks. This paper reveals a new privacy leakage source in super-apps: mini-app interaction history, including mini-app usage history (Mini-H) and operation history (Op-H). Mini-H refers to the history of mini-apps accessed by users, such as their frequency and categories. Op-H captures user interactions within mini-apps, including button clicks, bar drags, and image views. Super-apps can naturally collect these data without instrumentation due to the web-based feature of mini-apps. We identify these data types as novel and unexplored privacy risks through a literature review of 30 papers and an empirical analysis of 31 super-apps. We design a mini-app interaction history-oriented inference attack (THEFT), to exploit this new vulnerability. Using THEFT, the insider threats within the low-privilege business department of the super-app vendor acting as the adversary can achieve more than 95.5% accuracy in inferring privacy attributes of over 16.1% of users. THEFT only requires a small training dataset of 200 users from public breached databases on the Internet. We also engage with super-app vendors and a standards association to increase industry awareness and commitment to protect this data. Our contributions are significant in identifying overlooked privacy risks, demonstrating the effectiveness of a new attack, and influencing industry practices toward better privacy protection in the super-app ecosystem.
中文: 本文发现超级应用中的小程序交互历史构成新的隐私泄露源,通过THEFT攻击证明攻击者仅需少量训练数据即可以超过95.5%的准确率推断用户隐私属性。
English: This paper identifies mini-app interaction history as a new privacy leakage source in super-apps, demonstrating through the THEFT attack that adversaries can infer user privacy attributes with over 95.5% accuracy using minimal training data.

Authors:Yifeng Cai, Ziqi Zhang, Ding Li, Yao Guo, Xiangqun Chen
Title: Moss: Proxy Model-based Full-Weight Aggregation in Federated Learning with Heterogeneous Models
Abstract:
Modern Federated Learning (FL) has become increasingly essential for handling highly heterogeneous mobile devices. Current approaches adopt a partial model aggregation paradigm that leads to sub-optimal model accuracy and higher training overhead. In this paper, we challenge the prevailing notion of partial-model aggregation and propose a novel "full-weight aggregation" method named Moss, which aggregates all weights within heterogeneous models to preserve comprehensive knowledge. Evaluation across various applications demonstrates that Moss significantly accelerates training, reduces on-device training time and energy consumption, enhances accuracy, and minimizes network bandwidth utilization when compared to state-of-the-art baselines.
中文摘要:本文提出的Moss方法通过全权重聚合解决联邦学习中部分模型聚合的不足,在多种应用中显著提升了训练效率、精度并降低了资源消耗。
English Summary: The proposed Moss method introduces full-weight aggregation in Federated Learning to overcome the limitations of partial model aggregation, significantly improving training efficiency, accuracy, and resource utilization across diverse applications.

Authors:Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Title: HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge
Abstract:
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods.
中文: HiRAG作为一种新型图检索增强生成方法,通过利用层次化知识增强语义理解和结构捕捉能力,在实验中显著超越了现有最优方法。
English: HiRAG, a novel graph-based RAG method, leverages hierarchical knowledge to improve semantic understanding and structure capture in domain-specific tasks, outperforming existing approaches in experiments.

Authors:Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Title: Retrieval-Augmented Generation with Hierarchical Knowledge
Abstract:
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly enhanced the performance of large language models (LLMs) in domain-specific tasks. However, existing RAG methods do not adequately utilize the naturally inherent hierarchical knowledge in human cognition, which limits the capabilities of RAG systems. In this paper, we introduce a new RAG approach, called HiRAG, which utilizes hierarchical knowledge to enhance the semantic understanding and structure capturing capabilities of RAG systems in the indexing and retrieval processes. Our extensive experiments demonstrate that HiRAG achieves significant performance improvements over the state-of-the-art baseline methods.
中文: HiRAG作为一种新型图检索增强生成方法,通过利用层次化知识增强语义理解和结构捕捉能力,在实验中显著超越了现有最优方法。
English: HiRAG, a novel graph-based RAG method, leverages hierarchical knowledge to improve semantic understanding and structure capture in domain-specific tasks, outperforming existing approaches in experiments.

Authors:Haoxuan Wang, Jinlong Peng, Qingdong He, Hao Yang, Ying Jin, Jiafu Wu, Xiaobin Hu, Yanjie Pan, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
Title: UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
Abstract:
With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance.
Chinese: UniCombine是一种基于DiT的创新框架,通过新颖的注意力机制和新构建的数据集,有效整合文本、空间图和主体图像等多重条件输入,在多条件生成任务中实现了最先进的性能。
English: UniCombine is a novel DiT-based framework that effectively integrates multiple conditional inputs like text, spatial maps, and subject images through innovative attention mechanisms and a new dataset, achieving state-of-the-art performance in multi-conditional generation.

Authors:Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Xiatian Zhu, Lei Deng, Hongbin Liu, Zhen Lei
Title: Bayesian Test-Time Adaptation for Vision-Language Models
Abstract:
Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, \textbf{B}ayesian \textbf{C}lass \textbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.
中文: 提出的贝叶斯类别适应(BCA)方法通过持续更新类别嵌入和类别先验,提升了处理分布偏移时的预测精度与效率。
English: The proposed Bayesian Class Adaptation (BCA) method enhances test-time adaptation by continuously updating both class embeddings and class priors, improving accuracy and efficiency in handling distribution shifts.

Authors:Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang
Title: Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space
Abstract:
Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.
中文: 本文提出EOT-WM驾驶世界模型,通过统一主车与其他车辆的轨迹视频来提升驾驶模拟的真实交互性,在nuScenes数据集上FID和FVD指标分别优于现有最佳方法30%和55%。
English: This paper introduces EOT-WM, a driving world model that unifies ego and other vehicle trajectories in videos to enhance simulation realism by enabling controllable interactions, outperforming existing methods by 30% in FID and 55% in FVD on the nuScenes dataset.

Authors:Francesco Marchiori, Mauro Conti
Title: Leaky Batteries: A Novel Set of Side-Channel Attacks on Electric Vehicles
Abstract:
Advancements in battery technology have accelerated the adoption of Electric Vehicles (EVs) due to their environmental benefits. However, their growing sophistication introduces security and privacy challenges. Often seen as mere operational data, battery consumption patterns can unintentionally reveal critical information exploitable for malicious purposes. These risks go beyond privacy, impacting vehicle security and regulatory compliance. Despite these concerns, current research has largely overlooked the broader implications of battery consumption data exposure. As EVs integrate further into smart transportation networks, addressing these gaps is crucial to ensure their safety, reliability, and resilience. In this work, we introduce a novel class of side-channel attacks that exploit EV battery data to extract sensitive user information. Leveraging only battery consumption patterns, we demonstrate a methodology to accurately identify the EV driver and their driving style, determine the number of occupants, and infer the vehicle's start and end locations when user habits are known. We utilize several machine learning models and feature extraction techniques to analyze EV power consumption patterns, validating our approach on simulated and real-world datasets collected from actual drivers. Our attacks achieve an average success rate of 95.4% across all attack objectives. Our findings highlight the privacy risks associated with EV battery data, emphasizing the need for stronger protections to safeguard user privacy and vehicle security.
中文摘要:本研究揭示了一种新型旁路攻击方法,通过分析电动汽车电池耗电模式,能以95.4%的成功率精准推断驾驶员身份、行驶路线等敏感信息,凸显了电动汽车系统存在的重大隐私安全漏洞。
English Summary: This study reveals a novel side-channel attack that exploits electric vehicle battery consumption patterns to accurately infer sensitive user information, including driver identity and travel routes, with a 95.4% success rate, highlighting critical privacy vulnerabilities in EV systems.

Authors:Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman
Title: Interpreting the Repeated Token Phenomenon in Large Language Models
Abstract:
Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks'', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model's overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.
中文摘要:大型语言模型因注意力汇聚机制被破坏而无法准确重复单词,但针对性修复可解决此漏洞且不影响整体性能。
English Summary: Large Language Models fail to accurately repeat words due to disrupted attention sink circuits, but a targeted patch can fix this vulnerability without compromising overall performance.

Authors:Yubo Peng, Luping Xiang, Kun Yang, Feibo Jiang, Kezhi Wang, Dapeng Oliver Wu
Title: SIMAC: A Semantic-Driven Integrated Multimodal Sensing And Communication Framework
Abstract:
Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users' diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SIMAC) framework. This framework leverages a joint source-channel coding architecture to achieve simultaneous sensing decoding and transmission of sensing results. Specifically, SIMAC first introduces a multimodal semantic fusion (MSF) network, which employs two extractors to extract semantic information from radar signals and images, respectively. MSF then applies cross-attention mechanisms to fuse these unimodal features and generate multimodal semantic representations. Secondly, we present a large language model (LLM)-based semantic encoder (LSE), where relevant communication parameters and multimodal semantics are mapped into a unified latent space and input to the LLM, enabling channel-adaptive semantic encoding. Thirdly, a task-oriented sensing semantic decoder (SSD) is proposed, in which different decoded heads are designed according to the specific needs of tasks. Simultaneously, a multi-task learning strategy is introduced to train the SIMAC framework, achieving diverse sensing services. Finally, experimental simulations demonstrate that the proposed framework achieves diverse sensing services and higher accuracy.
中文: 提出的SIMAC框架通过融合多模态语义、基于大语言模型的编码和面向任务的解码,克服了传统单模态感知的局限,实现了多样化、高精度的感知服务并降低了延迟。
English: The proposed SIMAC framework overcomes the limitations of traditional single-modality sensing by integrating multimodal semantic fusion, LLM-based encoding, and task-oriented decoding to achieve diverse, high-accuracy sensing services with reduced latency.

Authors:Jingyi Zheng, Junfeng Wang, Zhen Sun, Wenhan Dong, Yule Liu, Xinlei He
Title: TH-Bench: Evaluating Evading Attacks via Humanizing AI Text on Machine-Generated Text Detectors
Abstract:
As Large Language Models (LLMs) advance, Machine-Generated Texts (MGTs) have become increasingly fluent, high-quality, and informative. Existing wide-range MGT detectors are designed to identify MGTs to prevent the spread of plagiarism and misinformation. However, adversaries attempt to humanize MGTs to evade detection (named evading attacks), which requires only minor modifications to bypass MGT detectors. Unfortunately, existing attacks generally lack a unified and comprehensive evaluation framework, as they are assessed using different experimental settings, model architectures, and datasets. To fill this gap, we introduce the Text-Humanization Benchmark (TH-Bench), the first comprehensive benchmark to evaluate evading attacks against MGT detectors. TH-Bench evaluates attacks across three key dimensions: evading effectiveness, text quality, and computational overhead. Our extensive experiments evaluate 6 state-of-the-art attacks against 13 MGT detectors across 6 datasets, spanning 19 domains and generated by 11 widely used LLMs. Our findings reveal that no single evading attack excels across all three dimensions. Through in-depth analysis, we highlight the strengths and limitations of different attacks. More importantly, we identify a trade-off among three dimensions and propose two optimization insights. Through preliminary experiments, we validate their correctness and effectiveness, offering potential directions for future research.
中文:本文提出了首个全面评估规避机器生成文本检测器的人类化攻击基准TH-Bench,揭示了攻击在规避效果、文本质量和计算开销之间的权衡关系,并提出了优化方向。
English: This paper introduces TH-Bench, the first comprehensive benchmark for evaluating text-humanization attacks that evade machine-generated text detectors, revealing a trade-off between evasion effectiveness, text quality, and computational efficiency while proposing optimization insights.

Authors:José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins
Title: Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
Abstract:
As automatic metrics become increasingly stronger and widely adopted, the risk of unintentionally "gaming the metric" during model development rises. This issue is caused by metric interference (MINT), i.e., the use of the same or related metrics for both model tuning and evaluation. MINT can misguide practitioners into being overoptimistic about the performance of their systems: as system outputs become a function of the interfering metric, their estimated quality loses correlation with human judgments. In this work, we analyze two common cases of MINT in machine translation-related tasks: filtering of training data, and decoding with quality signals. Importantly, we find that MINT strongly distorts instance-level metric scores, even when metrics are not directly optimized for-questioning the common strategy of leveraging a different, yet related metric for evaluation that is not used for tuning. To address this problem, we propose MINTADJUST, a method for more reliable evaluation under MINT. On the WMT24 MT shared task test set, MINTADJUST ranks translations and systems more accurately than state-of-the-art metrics across a majority of language pairs, especially for high-quality systems. Furthermore, MINTADJUST outperforms AUTORANK, the ensembling method used by the organizers.
中文: 该研究指出指标干扰(MINT)是导致模型调优与评估使用相关指标时产生性能高估的关键问题,并提出MINTADJUST方法,在机器翻译任务中实现更可靠的评估并优于现有指标。
English: The study identifies metric interference (MINT) as a key issue where using related metrics for both model tuning and evaluation leads to overoptimistic performance estimates, and proposes MINTADJUST, a method that provides more reliable evaluations and outperforms existing metrics in machine translation tasks.

Authors:Xiaosong Jia, Junqi You, Zhiyuan Zhang, Junchi Yan
Title: DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of perception-prediction-planning, which leads to cumulative errors and training instability. The manual ordering of tasks also limits the system`s ability to leverage synergies between tasks (for example, planning-aware perception and game-theoretic interactive prediction and planning). Moreover, the dense BEV representation adopted by existing methods brings computational challenges for long-range perception and long-term temporal fusion. To address these challenges, we present DriveTransformer, a simplified E2E-AD framework for the ease of scaling up, characterized by three key features: Task Parallelism (All agent, map, and planning queries direct interact with each other at each block), Sparse Representation (Task queries direct interact with raw sensor features), and Streaming Processing (Task queries are stored and passed as history information). As a result, the new framework is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention, which significantly reduces the complexity of system and leads to better training stability. DriveTransformer achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
中文:DriveTransformer提出一种简化的端到端自动驾驶框架,通过任务并行、稀疏表示和流式处理解决了累积误差与计算难题,在多个基准测试中取得最优性能并提升训练稳定性。
English: DriveTransformer introduces a simplified end-to-end autonomous driving framework using task parallelism, sparse representation, and streaming processing to overcome cumulative errors and computational challenges, achieving top performance in benchmarks with enhanced stability.

Authors:Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He
Title: Denoising Hamiltonian Network for Physical Reasoning
Abstract:
Machine learning frameworks for physical problems must capture and enforce physical constraints that preserve the structure of dynamical systems. Many existing approaches achieve this by integrating physical operators into neural networks. While these methods offer theoretical guarantees, they face two key limitations: (i) they primarily model local relations between adjacent time steps, overlooking longer-range or higher-level physical interactions, and (ii) they focus on forward simulation while neglecting broader physical reasoning tasks. We propose the Denoising Hamiltonian Network (DHN), a novel framework that generalizes Hamiltonian mechanics operators into more flexible neural operators. DHN captures non-local temporal relationships and mitigates numerical integration errors through a denoising mechanism. DHN also supports multi-system modeling with a global conditioning mechanism. We demonstrate its effectiveness and flexibility across three diverse physical reasoning tasks with distinct inputs and outputs.
中文: 去噪哈密顿网络将哈密顿力学推广为神经算子,通过捕捉非局部时间关系和全局调节机制,克服了现有方法仅关注局部时间步长和正向模拟的局限,支持多种物理推理任务。
English: The Denoising Hamiltonian Network (DHN) generalizes Hamiltonian mechanics into neural operators to capture non-local temporal relationships and support multi-system modeling, addressing limitations in existing approaches by enabling broader physical reasoning tasks beyond forward simulation.

Authors:Yubo Peng, Luping Xiang, Kun Yang, Kezhi Wang, Merouane Debbah
Title: Semantic Communications with Computer Vision Sensing for Edge Video Transmission
Abstract:
Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.
中文: 提出的SCCVS框架通过轻量级计算机视觉模型实时感知场景变化并自适应调整压缩比,在保持语义完整性的同时显著提升了边缘视频传输效率并降低了频谱消耗。
English: The proposed SCCVS framework enhances edge video transmission efficiency by adaptively adjusting compression ratios based on real-time scene changes detected through lightweight computer vision models, significantly reducing spectrum usage while preserving semantic integrity.

Authors:Amira Guesmi, Bassem Ouni, Muhammad Shafique
Title: Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs
Abstract:
Quantized Neural Networks (QNNs) have emerged as a promising solution for reducing model size and computational costs, making them well-suited for deployment in edge and resource-constrained environments. While quantization is known to disrupt gradient propagation and enhance robustness against pixel-level adversarial attacks, its effectiveness against patch-based adversarial attacks remains largely unexplored. In this work, we demonstrate that adversarial patches remain highly transferable across quantized models, achieving over 70\% attack success rates (ASR) even at extreme bit-width reductions (e.g., 2-bit). This challenges the common assumption that quantization inherently mitigates adversarial threats. To address this, we propose Quantization-Aware Defense Training with Randomization (QADT-R), a novel defense strategy that integrates Adaptive Quantization-Aware Patch Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly transferable patch-based attacks. A-QAPA generates adversarial patches within quantized models, ensuring robustness across different bit-widths. DBWT introduces bit-width cycling during training to prevent overfitting to a specific quantization setting, while GIR injects controlled gradient perturbations to disrupt adversarial optimization. Extensive evaluations on CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to prior defenses such as PBAT and DWQ. Our findings further reveal that PBAT-trained models, while effective against seen patch configurations, fail to generalize to unseen patches due to quantization shift. Additionally, our empirical analysis of gradient alignment, spatial sensitivity, and patch visibility provides insights into the mechanisms that contribute to the high transferability of patch-based attacks in QNNs.
Chinese: 量化神经网络(QNNs)对高可迁移性的基于补丁的对抗攻击依然脆弱,为此我们提出了QADT-R防御策略,通过自适应补丁生成、动态位宽训练和梯度正则化相结合,显著降低了攻击成功率。
English: Quantized Neural Networks (QNNs) remain vulnerable to highly transferable patch-based adversarial attacks, prompting the development of QADT-R, a novel defense strategy that integrates adaptive patch generation, dynamic bit-width training, and gradient regularization to significantly reduce attack success rates.

Authors:Zhenran Tang, Ruixuan Liu, Changliu Liu
Title: Eye-in-Finger: Smart Fingers for Delicate Assembly and Disassembly of LEGO
Abstract:
Manipulation and insertion of small and tight-toleranced objects in robotic assembly remain a critical challenge for vision-based robotics systems due to the required precision and cluttered environment. Conventional global or wrist-mounted cameras often suffer from occlusions when either assembling or disassembling from an existing structure. To address the challenge, this paper introduces "Eye-in-Finger", a novel tool design approach that enhances robotic manipulation by embedding low-cost, high-resolution perception directly at the tool tip. We validate our approach using LEGO assembly and disassembly tasks, which require the robot to manipulate in a cluttered environment and achieve sub-millimeter accuracy and robust error correction due to the tight tolerances. Experimental results demonstrate that our proposed system enables real-time, fine corrections to alignment error, increasing the tolerance of calibration error from 0.4mm to up to 2.0mm for the LEGO manipulation robot.
中文: 本文提出"指尖之眼"工具设计,通过在工具尖端嵌入高分辨率感知技术,增强了机器人在复杂环境中的操作能力,在乐高积木装配任务中实现了亚毫米级精度和鲁棒误差校正。
English: This paper introduces the "Eye-in-Finger" tool design, embedding high-resolution perception at the tool tip to enhance robotic manipulation in cluttered environments, achieving sub-millimeter accuracy and robust error correction in LEGO assembly tasks.

Authors:Yan Wang, Shijie Zhao, Kai Chen, Kexin Zhang, Junlin Li, Li Zhang
Title: GenDR: Lightning Generative Detail Restorator
Abstract:
Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
中文摘要:近期将文本到图像扩散模型应用于真实世界超分辨率的研究面临速度与细节保真度的两难问题,GenDR通过具有扩展潜空间的一步式扩散模型和结合超分辨率先验的蒸馏技术,在量化指标与视觉保真度上均达到最优性能。
English Summary: Recent text-to-image diffusion models applied to super-resolution face a trade-off between speed and detail accuracy, which GenDR resolves through a one-step model with expanded latent space and specialized distillation techniques to achieve superior performance.

Authors:Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang
Title: PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation
Abstract:
Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.
中文: PixelPonder提出了一种统一控制框架,通过局部自适应选择和时间感知注入机制,有效协调多种视觉条件,在扩散式文本生成图像中解决了结构失真问题并显著提升了生成质量。
English: PixelPonder introduces a unified control framework with patch-level adaptive selection and time-aware injection to effectively manage multiple visual conditions, overcoming structural distortions and enhancing image quality in diffusion-based text-to-image generation.

Authors:Jieyang Chen, Qian Gong, Yanliang Li, Xin Liang, Lipeng Wan, Qing Liu, Norbert Podhorszki, Scott Klasky
Title: HPDR: High-Performance Portable Scientific Data Reduction Framework
Abstract:
The rapid growth of scientific data is surpassing advancements in computing, creating challenges in storage, transfer, and analysis, particularly at the exascale. While data reduction techniques such as lossless and lossy compression help mitigate these issues, their computational overhead introduces new bottlenecks. GPU-accelerated approaches improve performance but face challenges in portability, memory transfer, and scalability on multi-GPU systems. To address these, we propose HPDR, a high-performance, portable data reduction framework. HPDR supports diverse processor architectures, reducing memory transfer overhead to 2.3% and achieving up to 3.5x faster throughput than existing solutions. It attains 96% of the theoretical speedup in multi-GPU settings. Evaluations on the Frontier supercomputer demonstrate 103 TB/s throughput and up to 4x acceleration in parallel I/O performance at scale. HPDR offers a scalable, efficient solution for managing massive data volumes in exascale computing environments.
中文: HPDR是一种高性能、可移植的数据缩减框架,通过将内存传输开销降至2.3%并实现高达3.5倍的吞吐量提升,有效应对百亿亿次计算的数据挑战,在Frontier超级计算机上的评估显示其吞吐量达103 TB/s并显著加速并行I/O性能。
English: HPDR is a high-performance, portable data reduction framework that addresses exascale computing challenges by minimizing memory transfer overhead to 2.3% and achieving up to 3.5x faster throughput, with evaluations on the Frontier supercomputer demonstrating 103 TB/s throughput and enhanced parallel I/O performance.

Authors:Xuanyu Zhang, Jiarui Meng, Zhipei Xu, Shuzhou Yang, Yanmin Wu, Ronggang Wang, Jian Zhang
Title: SecureGS: Boosting the Security and Fidelity of 3D Gaussian Splatting Steganography
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a premier method for 3D representation due to its real-time rendering and high-quality outputs, underscoring the critical need to protect the privacy of 3D assets. Traditional NeRF steganography methods fail to address the explicit nature of 3DGS since its point cloud files are publicly accessible. Existing GS steganography solutions mitigate some issues but still struggle with reduced rendering fidelity, increased computational demands, and security flaws, especially in the security of the geometric structure of the visualized point cloud. To address these demands, we propose a SecureGS, a secure and efficient 3DGS steganography framework inspired by Scaffold-GS's anchor point design and neural decoding. SecureGS uses a hybrid decoupled Gaussian encryption mechanism to embed offsets, scales, rotations, and RGB attributes of the hidden 3D Gaussian points in anchor point features, retrievable only by authorized users through privacy-preserving neural networks. To further enhance security, we propose a density region-aware anchor growing and pruning strategy that adaptively locates optimal hiding regions without exposing hidden information. Extensive experiments show that SecureGS significantly surpasses existing GS steganography methods in rendering fidelity, speed, and security.
Chinese: 针对3D高斯泼溅(3DGS)在隐私保护方面的不足,现有方法难以兼顾渲染质量与安全性,为此提出SecureGS框架,通过混合高斯加密和自适应锚点策略,在隐写过程中实现了高保真渲染、高效运行和强安全保障。
English: 3D Gaussian Splatting (3DGS) requires robust privacy protection for 3D assets, which existing methods fail to provide without compromising rendering quality or security, prompting the development of SecureGS—a novel framework that employs hybrid Gaussian encryption and adaptive anchor strategies to ensure high fidelity, efficiency, and security in steganography.

Authors:Lunchen Xie, Eugenio Lomurno, Matteo Gambella, Danilo Ardagna, Manual Roveri, Matteo Matteucci, Qingjiang Shi
Title: ZO-DARTS++: An Efficient and Size-Variable Zeroth-Order Neural Architecture Search Algorithm
Abstract:
Differentiable Neural Architecture Search (NAS) provides a promising avenue for automating the complex design of deep learning (DL) models. However, current differentiable NAS methods often face constraints in efficiency, operation selection, and adaptability under varying resource limitations. We introduce ZO-DARTS++, a novel NAS method that effectively balances performance and resource constraints. By integrating a zeroth-order approximation for efficient gradient handling, employing a sparsemax function with temperature annealing for clearer and more interpretable architecture distributions, and adopting a size-variable search scheme for generating compact yet accurate architectures, ZO-DARTS++ establishes a new balance between model complexity and performance. In extensive tests on medical imaging datasets, ZO-DARTS++ improves the average accuracy by up to 1.8\% over standard DARTS-based methods and shortens search time by approximately 38.6\%. Additionally, its resource-constrained variants can reduce the number of parameters by more than 35\% while maintaining competitive accuracy levels. Thus, ZO-DARTS++ offers a versatile and efficient framework for generating high-quality, resource-aware DL models suitable for real-world medical applications.
中文摘要:ZO-DARTS++ 是一种改进的神经架构搜索方法,通过零阶优化和稀疏最大化技术实现了性能与资源的最佳平衡,在医学影像应用中显著提升了准确率并大幅缩短了搜索时间。
English Summary: ZO-DARTS++ is an enhanced neural architecture search method that achieves superior performance-resource balance through zeroth-order optimization and sparsemax techniques, demonstrating significant accuracy improvements and search efficiency gains in medical imaging applications.

Authors:Mingcong Xu, Xiaojin Zhang, Wei Chen, Hai Jin
Title: FedEM: A Privacy-Preserving Framework for Concurrent Utility Preservation in Federated Learning
Abstract:
Federated Learning (FL) enables collaborative training of models across distributed clients without sharing local data, addressing privacy concerns in decentralized systems. However, the gradient-sharing process exposes private data to potential leakage, compromising FL's privacy guarantees in real-world applications. To address this issue, we propose Federated Error Minimization (FedEM), a novel algorithm that incorporates controlled perturbations through adaptive noise injection. This mechanism effectively mitigates gradient leakage attacks while maintaining model performance. Experimental results on benchmark datasets demonstrate that FedEM significantly reduces privacy risks and preserves model accuracy, achieving a robust balance between privacy protection and utility preservation.
Chinese: 提出的联邦误差最小化(FedEM)算法通过自适应噪声注入技术,在保护模型性能的同时有效防止梯度泄露,从而增强了联邦学习的隐私安全性。
English: The proposed Federated Error Minimization (FedEM) algorithm enhances privacy in federated learning by using adaptive noise injection to prevent gradient leakage while maintaining model performance.

Authors:Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu
Title: VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
Abstract:
Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.
中文摘要:VideoPainter提出了一种双流范式,通过高效的上下文编码器将背景信息融入预训练扩散模型,显著提升了视频修复的内容一致性和处理任意长度视频的能力,并借助大规模数据集在多项指标中展现出卓越性能。
English Summary: VideoPainter introduces a dual-stream paradigm with an efficient context encoder to enhance video inpainting by integrating background context into pre-trained diffusion models, enabling superior performance in generating consistent content and supporting any-length video processing through novel techniques and a large-scale dataset.

Authors:Pushkar Mishra, Charvi Rastogi, Stephen R. Pfohl, Alicia Parrish, Tian Huey Teh, Roma Patel, Mark Diaz, Ding Wang, Michela Paganini, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser
Title: Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity
Abstract:
Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for interpreting granular ratings in pluralistic datasets. Specifically, we address the challenge of analyzing nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring varying levels of the severity of safety violations. Leveraging a publicly available pluralistic dataset of safety feedback on AI-generated content as our case study, we investigate how raters from different demographic groups (age, gender, ethnicity) use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI development.
中文摘要:本文提出一种数据驱动方法,用于分析不同人口群体对AI安全性的细粒度评级,通过优化评分者筛选和反馈解读机制,提升多元文化背景下AI系统的对齐可靠性。
English Summary: This paper introduces a data-driven method to analyze nuanced safety ratings across diverse demographic groups, improving AI alignment in multicultural contexts through refined rater selection and feedback interpretation.

Authors:Emil Toftegaard Gæde, Ivor van der Hoog, Eva Rotenberg, Tord Stordalen
Title: Dynamic Indexing Through Learned Indices with Worst-case Guarantees
Abstract:
Indexing data is a fundamental problem in computer science. Recently, various papers apply machine learning to this problem. For a fixed integer $\varepsilon$, a \emph{learned index} is a function $h : \mathcal{U} \rightarrow [0, n]$ where $\forall q \in \mathcal{U}$, $h(q) \in [\text{rank}(q) - \varepsilon, \text{rank}(q) + \varepsilon]$. These works use machine learning to compute $h$. Then, they store $S$ in a sorted array $A$ and access $A[\lfloor h(q) \rfloor]$ to answer queries in $O(k + \varepsilon + \log |h|)$ time. Here, $k$ denotes the output size and $|h|$ the complexity of $h$. Ferragina and Vinciguerra (VLDB 2020) observe that creating a learned index is a geometric problem. They define the PGM index by restricting $h$ to a piecewise linear function and show a linear-time algorithm to compute a PGM index of approximate minimum complexity. Since indexing queries are decomposable, the PGM index may be made dynamic through the logarithmic method. When allowing deletions, range query times deteriorate to worst-case $O(N + \sum\limits_i^{\lceil \log n \rceil } (\varepsilon + \log |h_i|))$ time (where $N$ is the largest size of $S$ seen so far). This paper offers a combination of theoretical insights and experiments as we apply techniques from computational geometry to dynamically maintain an approximately minimum-complexity learned index $h : \mathcal{U} \rightarrow [0, n]$ with $O(\log^2 n)$ update time. We also prove that if we restrict $h$ to lie in a specific subclass of piecewise-linear functions, then we can combine $h$ and hash maps to support queries in $O(k + \varepsilon + \log |h|)$ time (at the cost of increasing $|h|$). We implement our algorithm and compare it to the existing implementation. Our empirical analysis shows that our solution supports more efficient range queries in the special case where the update sequence contains many deletions.
中文: 本文应用计算几何技术动态维护近似最小复杂度的学习索引,实现O(log² n)的更新时间,并在频繁删除场景下展现出更高效的区间查询性能。
English: This paper applies computational geometry techniques to dynamically maintain a learned index with approximately minimum complexity, achieving O(log² n) update time and demonstrating more efficient range queries in scenarios with frequent deletions.

Authors:Ziyuan Yang, Yingyu Chen, Chengrui Gao, Andrew Beng Jin Teoh, Bob Zhang, Yi Zhang
Title: FedPalm: A General Federated Learning Framework for Closed- and Open-Set Palmprint Verification
Abstract:
Current deep learning (DL)-based palmprint verification models rely on centralized training with large datasets, which raises significant privacy concerns due to biometric data's sensitive and immutable nature. Federated learning~(FL), a privacy-preserving distributed learning paradigm, offers a compelling alternative by enabling collaborative model training without the need for data sharing. However, FL-based palmprint verification faces critical challenges, including data heterogeneity from diverse identities and the absence of standardized evaluation benchmarks. This paper addresses these gaps by establishing a comprehensive benchmark for FL-based palmprint verification, which explicitly defines and evaluates two practical scenarios: closed-set and open-set verification. We propose FedPalm, a unified FL framework that balances local adaptability with global generalization. Each client trains a personalized textural expert tailored to local data and collaboratively contributes to a shared global textural expert for extracting generalized features. To further enhance verification performance, we introduce a Textural Expert Interaction Module that dynamically routes textural features among experts to generate refined side textural features. Learnable parameters are employed to model relationships between original and side features, fostering cross-texture-expert interaction and improving feature discrimination. Extensive experiments validate the effectiveness of FedPalm, demonstrating robust performance across both scenarios and providing a promising foundation for advancing FL-based palmprint verification research.
中文: 本文提出FedPalm联邦学习框架,通过协同训练个性化与共享纹理专家解决掌纹验证中的隐私与数据异质性问题,并在闭集与开集场景的完整基准测试中验证了其有效性。
English: This paper introduces FedPalm, a federated learning framework that addresses privacy and data heterogeneity in palmprint verification by enabling collaborative training of personalized and shared textural experts, validated through comprehensive benchmarks for both closed-set and open-set scenarios.

Authors:José Siqueira de Cerqueira, Kai-Kristian Kemell, Rebekah Rousi, Nannan Xi, Juho Hamari, Pekka Abrahamsson
Title: Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to Practice
Abstract:
The rapid proliferation of Large Language Models (LLMs) has raised significant trustworthiness and ethical concerns. Despite the widespread adoption of LLMs across domains, there is still no clear consensus on how to define and operationalise trustworthiness. This study aims to bridge the gap between theoretical discussion and practical implementation by analysing research trends, definitions of trustworthiness, and practical techniques. We conducted a bibliometric mapping analysis of 2,006 publications from Web of Science (2019-2025) using the Bibliometrix, and manually reviewed 68 papers. We found a shift from traditional AI ethics discussion to LLM trustworthiness frameworks. We identified 18 different definitions of trust/trustworthiness, with transparency, explainability and reliability emerging as the most common dimensions. We identified 20 strategies to enhance LLM trustworthiness, with fine-tuning and retrieval-augmented generation (RAG) being the most prominent. Most of the strategies are developer-driven and applied during the post-training phase. Several authors propose fragmented terminologies rather than unified frameworks, leading to the risks of "ethics washing," where ethical discourse is adopted without a genuine regulatory commitment. Our findings highlight: persistent gaps between theoretical taxonomies and practical implementation, the crucial role of the developer in operationalising trust, and call for standardised frameworks and stronger regulatory measures to enable trustworthy and ethical deployment of LLMs.
中文: 本研究通过分析大语言模型可信度的研究趋势与实践策略,发现该领域已从人工智能伦理讨论转向可信度框架构建,识别出透明度和可靠性等关键维度,同时揭示了理论与实践之间的差距,并呼吁建立标准化框架。
English: This study analyzes research trends and practical strategies for enhancing the trustworthiness of Large Language Models, revealing a shift from AI ethics to operational frameworks and identifying key dimensions like transparency and reliability, while highlighting gaps between theory and practice and the need for standardized approaches.

Authors:Yanqing Shen, Turcan Tuna, Marco Hutter, Cesar Cadena, Nanning Zheng
Title: ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images
Abstract:
Place recognition is essential to maintain global consistency in large-scale localization systems. While research in urban environments has progressed significantly using LiDARs or cameras, applications in natural forest-like environments remain largely under-explored. Furthermore, forests present particular challenges due to high self-similarity and substantial variations in vegetation growth over time. In this work, we propose a robust LiDAR-based place recognition method for natural forests, ForestLPR. We hypothesize that a set of cross-sectional images of the forest's geometry at different heights contains the information needed to recognize revisiting a place. The cross-sectional images are represented by \ac{bev} density images of horizontal slices of the point cloud at different heights. Our approach utilizes a visual transformer as the shared backbone to produce sets of local descriptors and introduces a multi-BEV interaction module to attend to information at different heights adaptively. It is followed by an aggregation layer that produces a rotation-invariant place descriptor. We evaluated the efficacy of our method extensively on real-world data from public benchmarks as well as robotic datasets and compared it against the state-of-the-art (SOTA) methods. The results indicate that ForestLPR has consistently good performance on all evaluations and achieves an average increase of 7.38\% and 9.11\% on Recall@1 over the closest competitor on intra-sequence loop closure detection and inter-sequence re-localization, respectively, validating our hypothesis
Chinese: 本文提出了一种针对自然森林环境的鲁棒激光雷达地点识别方法ForestLPR,通过不同高度的森林几何横截面图像和视觉变换器架构生成旋转不变描述符,在闭环检测和重定位任务中相比现有最优方法取得了显著性能提升。
English: This paper introduces ForestLPR, a robust LiDAR-based place recognition method for natural forests that uses cross-sectional images of forest geometry at different heights and a visual transformer backbone to generate rotation-invariant descriptors, achieving significant performance improvements over state-of-the-art methods in loop closure detection and re-localization.

Authors:Marco Arazzi, Mert Cihangiroglu, Antonino Nocera
Title: Privacy Preserving and Robust Aggregation for Cross-Silo Federated Learning in Non-IID Settings
Abstract:
Federated Averaging remains the most widely used aggregation strategy in federated learning due to its simplicity and scalability. However, its performance degrades significantly in non-IID data settings, where client distributions are highly imbalanced or skewed. Additionally, it relies on clients transmitting metadata, specifically the number of training samples, which introduces privacy risks and may conflict with regulatory frameworks like the European GDPR. In this paper, we propose a novel aggregation strategy that addresses these challenges by introducing class-aware gradient masking. Unlike traditional approaches, our method relies solely on gradient updates, eliminating the need for any additional client metadata, thereby enhancing privacy protection. Furthermore, our approach validates and dynamically weights client contributions based on class-specific importance, ensuring robustness against non-IID distributions, convergence prevention, and backdoor attacks. Extensive experiments on benchmark datasets demonstrate that our method not only outperforms FedAvg and other widely accepted aggregation strategies in non-IID settings but also preserves model integrity in adversarial scenarios. Our results establish the effectiveness of gradient masking as a practical and secure solution for federated learning.
中文: 针对联邦平均在非独立同分布数据中的性能下降和隐私风险,本文提出了一种基于类别感知梯度掩码的新型聚合策略,无需客户端元数据,通过动态加权确保鲁棒性并提升模型安全与性能。
English: Federated Averaging's limitations in non-IID data and privacy risks are addressed by a novel aggregation strategy using class-aware gradient masking, which eliminates metadata transmission and dynamically weights client contributions for enhanced security and performance.

Authors:Jiayi Chang, Mingqi Gao, Xinyu Hu, Xiaojun Wan
Title: Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators
Abstract:
Previous research has shown that LLMs have potential in multilingual NLG evaluation tasks. However, existing research has not fully explored the differences in the evaluation capabilities of LLMs across different languages. To this end, this study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs, spanning high-resource and low-resource languages through correlation analysis, perturbation attacks, and fine-tuning. We found that 1) excluding the reference answer from the prompt and using large-parameter LLM-based evaluators leads to better performance across various languages; 2) most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages; 3) in the languages where they are most sensitive to such attacks, they also tend to exhibit the highest correlation with human judgments; and 4) fine-tuning with data from a particular language yields a broadly consistent enhancement in the model's evaluation performance across diverse languages. Our findings highlight the imbalance in LLMs'evaluation capabilities across different languages and suggest that low-resource language scenarios deserve more attention.
中文: 本研究表明大型语言模型在多语言评估中存在能力不均衡,高资源语言表现更优而低资源语言需更多关注,其性能受提示设计、模型规模及语言特定微调的影响。
English: This study reveals that large language models exhibit imbalanced evaluation capabilities across languages, performing better in high-resource languages while low-resource scenarios require greater attention, with performance influenced by prompt design, model scale, and language-specific fine-tuning.

Authors:Yixiao Ge, Arthur Pearce, Pieter van Goor, Robert Mahony
Title: Equivariant Filter Design for Range-only SLAM
Abstract:
Range-only Simultaneous Localisation and Mapping (RO-SLAM) is of interest due to its practical applications in ultra-wideband (UWB) and Bluetooth Low Energy (BLE) localisation in terrestrial and aerial applications and acoustic beacon localisation in submarine applications. In this work, we consider a mobile robot equipped with an inertial measurement unit (IMU) and a range sensor that measures distances to a collection of fixed landmarks. We derive an equivariant filter (EqF) for the RO-SLAM problem based on a symmetry Lie group that is compatible with the range measurements. The proposed filter does not require bootstrapping or initialisation of landmark positions, and demonstrates robustness to the no-prior situation. The filter is demonstrated on a real-world dataset, and it is shown to significantly outperform a state-of-the-art EKF alternative in terms of both accuracy and robustness.
中文摘要:本文提出的距离同步定位与建图等变滤波器无需地标初始化,在实际应用中展现出比先进扩展卡尔曼滤波器更优的精度与鲁棒性。
English summary: The proposed equivariant filter for range-only SLAM eliminates the need for landmark initialization and demonstrates superior accuracy and robustness over advanced EKF methods in real-world applications.

Authors:Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, Jun Gao
Title: GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Abstract:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/
中文摘要:GEN3C提出了一种生成式视频模型,通过利用深度预测点云构建的3D缓存,实现了精确的相机控制和时间上的3D一致性,使模型能专注于生成新区域,并在新颖视角合成中表现出卓越性能。
English Summary: GEN3C introduces a generative video model that achieves precise camera control and temporal 3D consistency by utilizing a 3D cache from depth-predicted point clouds, enabling focused generation on new regions and superior performance in novel view synthesis.

Authors:Hainiu Xu, Siya Qi, Jiazheng Li, Yuxiang Zhou, Jinhua Du, Caroline Catmur, Yulan He
Title: EnigmaToM: Improve LLMs' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States
Abstract:
Theory-of-Mind (ToM), the ability to infer others' perceptions and mental states, is fundamental to human interaction but remains challenging for Large Language Models (LLMs). While existing ToM reasoning methods show promise with reasoning via perceptual perspective-taking, they often rely excessively on off-the-shelf LLMs, reducing their efficiency and limiting their applicability to high-order ToM reasoning. To address these issues, we present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma) for (1) a psychology-inspired iterative masking mechanism that facilitates accurate perspective-taking and (2) knowledge injection that elicits key entity information. Enigma generates structured knowledge of entity states to build spatial scene graphs for belief tracking across various ToM orders and enrich events with fine-grained entity state details. Experimental results on ToMi, HiToM, and FANToM benchmarks show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes, particularly excelling in high-order reasoning scenarios.
中文摘要:EnigmaToM是一种神经符号框架,通过整合神经知识库与迭代掩蔽和知识注入机制,显著提升了大语言模型的心理理论推理能力,在多种测试基准中表现优异,尤其擅长高阶推理场景。
English Summary: EnigmaToM is a neuro-symbolic framework that enhances Theory-of-Mind reasoning in LLMs by integrating a Neural Knowledge Base with iterative masking and knowledge injection, significantly improving performance across various ToM benchmarks, especially in high-order reasoning.

Authors:Runlong Yu, Shengyu Chen, Yiqun Xie, Xiaowei Jia
Title: A Survey of Foundation Models for Environmental Science
Abstract:
Modeling environmental ecosystems is essential for effective resource management, sustainable development, and understanding complex ecological processes. However, traditional methods frequently struggle with the inherent complexity, interconnectedness, and limited data of such systems. Foundation models, with their large-scale pre-training and universal representations, offer transformative opportunities by integrating diverse data sources, capturing spatiotemporal dependencies, and adapting to a broad range of tasks. This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in forward prediction, data generation, data assimilation, downscaling, model ensembling, and decision-making across domains. We also detail the development process of these models, covering data collection, architecture design, training, tuning, and evaluation. By showcasing these emerging methods, we aim to foster interdisciplinary collaboration and advance the integration of cutting-edge machine learning for sustainable solutions in environmental science.
中文: 基础模型通过整合多样化数据并适应广泛任务,为环境科学带来变革性机遇;本综述详述了其开发过程及在预测、数据生成和决策等领域的应用。
English: Foundation models offer transformative potential for environmental science by integrating diverse data and adapting to various tasks, as detailed in this survey covering their development and applications across prediction, data generation, and decision-making.

Authors:Hongjie Fang, Chenxi Wang, Yiming Wang, Jingjing Chen, Shangning Xia, Jun Lv, Zihao He, Xiyan Yi, Yunhan Guo, Xinyu Zhan, Lixin Yang, Weiming Wang, Cewu Lu, Hao-Shu Fang
Title: AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons
Abstract:
Scaling up robotic imitation learning for real-world applications requires efficient and scalable demonstration collection methods. While teleoperation is effective, it depends on costly and inflexible robot platforms. In-the-wild demonstrations offer a promising alternative, but existing collection devices have key limitations: handheld setups offer limited observational coverage, and whole-body systems often require fine-tuning with robot data due to domain gaps. To address these challenges, we present AirExo-2, a low-cost exoskeleton system for large-scale in-the-wild data collection, along with several adaptors that transform collected data into pseudo-robot demonstrations suitable for policy learning. We further introduce RISE-2, a generalizable imitation learning policy that fuses 3D spatial and 2D semantic perception for robust manipulations. Experiments show that RISE-2 outperforms prior state-of-the-art methods on both in-domain and generalization evaluations. Trained solely on adapted in-the-wild data produced by AirExo-2, the RISE-2 policy achieves comparable performance to the policy trained with teleoperated data, highlighting the effectiveness and potential of AirExo-2 for scalable and generalizable imitation learning.
中文: 本文提出低成本外骨骼系统AirExo-2用于大规模野外演示采集,以及RISE-2模仿学习策略,通过适配的真实数据实现了与遥操作训练相当的性能,展现了可扩展模仿学习的潜力。
English: This paper introduces AirExo-2, a cost-effective exoskeleton system for scalable in-the-wild demonstration collection, and RISE-2, a robust imitation learning policy that achieves comparable performance to teleoperated training through adapted real-world data.

Authors:Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran
Title: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Abstract:
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
中文:KodCode是一个通过系统验证的问题-解决方案-测试三元组构成的合成数据集,支持大规模多样化编程训练,经监督微调后在主流基准测试中实现了最先进的性能表现。
English: KodCode is a synthetic dataset featuring systematically validated question-solution-test triplets that enable large-scale, diverse coding training, achieving state-of-the-art performance on major benchmarks through supervised fine-tuning.

Authors:Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li
Title: MPO: Boosting LLM Agents with Meta Plan Optimization
Abstract:
Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, , which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent's task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.
中文摘要:提出的元计划优化(MPO)框架通过引入高层元计划指导并基于任务反馈持续优化,显著增强了基于大语言模型的智能体规划能力,在任务完成效率和泛化性能上全面超越现有方法。
English Summary: The proposed Meta Plan Optimization (MPO) framework enhances LLM-based agents' planning capabilities by integrating high-level meta plans and continuously optimizing them through task feedback, significantly outperforming existing methods in efficiency and generalization.

Authors:Zhiyuan Yu, Hong Ren, Cunhua Pan, Gui Zhou, Dongming Wang, Chau Yuen, Jiangzhou Wang
Title: A Framework for Uplink ISAC Receiver Designs: Performance Analysis and Algorithm Development
Abstract:
Uplink integrated sensing and communication (ISAC) systems have recently emerged as a promising research direction, enabling simultaneous uplink signal detection and target sensing. In this paper, we propose the flexible projection (FP)-type receiver that unify the projection-type receiver and the successive interference cancellation (SIC)-type receiver by using a flexible tradeoff factor to adapt to dynamically changing uplink ISAC scenarios. The FP-type receiver addresses the joint signal detection and target response estimation problem through two coordinated phases: 1) Communication signal detection using a reconstructed signal whose composition is controlled by the tradeoff factor, followed by 2) Target response estimation performed through subtraction of the detected communication signal from the received signal. With adjustable tradeoff factors, the FP-type receiver can balance the enhancement of the signal-to-interference-plus-noise ratio (SINR) with the reduction of correlation in the reconstructed signal for communication signal detection. The pairwise error probabilities (PEPs) are analyzed for both the maximum likelihood (ML) and the zero-forcing (ZF) detectors, revealing that the optimal tradeoff factor should be determined based on the adopted detection algorithm and the relative power of the sensing and communication (S\&C) signal. A homotopy optimization framework is first applied for the FP-type receiver with a fixed trade-off factor. This framework is then extended to develop the dynamic FP (DFP)-type receiver, which iteratively adjust the trade-off factor for improved algorithm performance and environmental adaptability. Subsequently, two extensions are explored to further enhance the receiver's performance: parallel DFP (PDFP)-type receiver and a block-structured receiver design. Finally, the effectiveness of the proposed receiver designs is verified via simulations.
中文: FP型接收机通过灵活权衡因子统一了投影型和连续干扰消除型接收机,动态适应上行链路ISAC场景,其扩展设计如动态FP进一步通过自适应因子调整提升性能。
English: The FP-type receiver unifies projection and SIC receivers with a flexible tradeoff factor to dynamically balance signal detection and target estimation in uplink ISAC systems, while its extensions like DFP further optimize performance through adaptive factor adjustments.

Authors:Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, Stephan Günnemann
Title: LLM-Safety Evaluations Lack Robustness
Abstract:
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
中文: 本文指出小数据集、方法不一致及评估不可靠等噪声问题阻碍了大语言模型安全措施的公正评估,并提出了减少偏见、提升未来研究可比性的指导原则。
English: This paper contends that noise from small datasets, methodological inconsistencies, and unreliable evaluations impedes fair assessment of large language model safety measures, proposing guidelines to reduce bias and improve comparability in future research.

Authors:Tanveer Khan, Fahad Sohrab, Antonis Michalas, Moncef Gabbouj
Title: To Vaccinate or not to Vaccinate? Analyzing $\mathbb{X}$ Power over the Pandemic
Abstract:
The COVID-19 pandemic has profoundly affected the normal course of life -- from lock-downs and virtual meetings to the unprecedentedly swift creation of vaccines. To halt the COVID-19 pandemic, the world has started preparing for the global vaccine roll-out. In an effort to navigate the immense volume of information about COVID-19, the public has turned to social networks. Among them, $\mathbb{X}$ (formerly Twitter) has played a key role in distributing related information. Most people are not trained to interpret medical research and remain skeptical about the efficacy of new vaccines. Measuring their reactions and perceptions is gaining significance in the fight against COVID-19. To assess the public perception regarding the COVID-19 vaccine, our work applies a sentiment analysis approach, using natural language processing of $\mathbb{X}$ data. We show how to use textual analytics and textual data visualization to discover early insights (for example, by analyzing the most frequently used keywords and hashtags). Furthermore, we look at how people's sentiments vary across the countries. Our results indicate that although the overall reaction to the vaccine is positive, there are also negative sentiments associated with the tweets, especially when examined at the country level. Additionally, from the extracted tweets, we manually labeled 100 tweets as positive and 100 tweets as negative and trained various One-Class Classifiers (OCCs). The experimental results indicate that the S-SVDD classifiers outperform other OCCs.
新冠疫情深刻改变了正常生活并推动了疫苗的快速研发,本研究通过分析𝕏平台的社交媒体数据来评估公众对疫苗的情绪,发现整体反应积极但各国存在负面差异。
The COVID-19 pandemic has disrupted daily life and accelerated vaccine development, prompting this study to analyze public sentiment on vaccines through natural language processing of social media data from 𝕏, revealing predominantly positive but varied reactions across countries.

Authors:Yiyun Zhou, Zheqi Lv, Shengyu Zhang, Jingyuan Chen
Title: Disentangled Knowledge Tracing for Alleviating Cognitive Bias
Abstract:
In the realm of Intelligent Tutoring System (ITS), the accurate assessment of students' knowledge states through Knowledge Tracing (KT) is crucial for personalized learning. However, due to data bias, $\textit{i.e.}$, the unbalanced distribution of question groups ($\textit{e.g.}$, concepts), conventional KT models are plagued by cognitive bias, which tends to result in cognitive underload for overperformers and cognitive overload for underperformers. More seriously, this bias is amplified with the exercise recommendations by ITS. After delving into the causal relations in the KT models, we identify the main cause as the confounder effect of students' historical correct rate distribution over question groups on the student representation and prediction score. Towards this end, we propose a Disentangled Knowledge Tracing (DisKT) model, which separately models students' familiar and unfamiliar abilities based on causal effects and eliminates the impact of the confounder in student representation within the model. Additionally, to shield the contradictory psychology ($\textit{e.g.}$, guessing and mistaking) in the students' biased data, DisKT introduces a contradiction attention mechanism. Furthermore, DisKT enhances the interpretability of the model predictions by integrating a variant of Item Response Theory. Experimental results on 11 benchmarks and 3 synthesized datasets with different bias strengths demonstrate that DisKT significantly alleviates cognitive bias and outperforms 16 baselines in evaluation accuracy.
中文摘要:解构知识追踪(DisKT)模型通过因果效应分别建模学生的熟悉与陌生能力,并采用矛盾注意力机制,有效解决了知识追踪中的认知偏差问题,在多个数据集上显著优于现有方法。
English Summary: The Disentangled Knowledge Tracing (DisKT) model addresses cognitive bias in Knowledge Tracing by separately modeling students' familiar and unfamiliar abilities through causal inference and a contradiction attention mechanism, significantly outperforming existing methods across multiple datasets.

Authors:Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan
Title: TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.
中文: TeTRA是一种三元Transformer方法,通过将视觉Transformer量化为2位精度并二值化嵌入层,在资源受限平台上大幅减小模型规模和延迟,同时保持或提升视觉位置识别的准确性。
English: TeTRA is a ternary transformer method that quantizes Vision Transformers to 2-bit precision and binarizes embeddings, significantly cutting model size and latency while maintaining or improving accuracy for visual place recognition on resource-limited platforms.

Authors:Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang
Title: ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment
Abstract:
We aim to develop a goal specification method that is semantically clear, spatially sensitive, domain-agnostic, and intuitive for human users to guide agent interactions in 3D environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their camera views rather than the agent's observations. We highlight that behavior cloning alone fails to align the agent's behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3x to 6x compared to ROCKET-1. We show that ROCKET-2 can directly interpret goals from human camera views, enabling better human-agent interaction. Remarkably, ROCKET-2 demonstrates zero-shot generalization capabilities: despite being trained exclusively on the Minecraft dataset, it can adapt and generalize to other 3D environments like Doom, DMLab, and Unreal through a simple action space mapping.
中文: 我们提出了一种跨视角目标对齐框架,允许用户通过自身视角指定目标,结合辅助目标增强智能体的空间推理能力,实现了在3D环境中的高效零样本泛化。
English: We introduce a cross-view goal alignment framework that enables users to specify goals via their camera views, enhancing agent spatial reasoning with auxiliary objectives and achieving efficient, zero-shot generalization across 3D environments.

Authors:Sebastian Schmidt, Leonard Schenk, Leo Schwinn, Stephan Günnemann
Title: Joint Out-of-Distribution Filtering and Data Discovery Active Learning
Abstract:
As the data demand for deep learning models increases, active learning (AL) becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs. Real-world scenarios necessitate the consideration of incomplete data knowledge within AL. Prior works address handling out-of-distribution (OOD) data, while another research direction has focused on category discovery. However, a combined analysis of real-world considerations combining AL with out-of-distribution data and category discovery remains unexplored. To address this gap, we propose Joint Out-of-distribution filtering and data Discovery Active learning (Joda) , to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling. In contrast to previous methods, we deeply entangle the training procedure with filter and selection to construct a common feature space that aligns known and novel categories while separating OOD samples. Unlike previous works, Joda is highly efficient and completely omits auxiliary models and training access to the unlabeled pool for filtering or selection. In extensive experiments on 18 configurations and 3 metrics, \ours{} consistently achieves the highest accuracy with the best class discovery to OOD filtering balance compared to state-of-the-art competitor approaches.
中文: 随着数据需求增长,主动学习对高效选择标注样本至关重要;提出的Joda方法创新性地结合分布外过滤与类别发现,无需辅助模型即可提升模型精度并实现最优平衡。
English: Active learning is crucial for efficiently selecting samples to label amid growing data demands, and the proposed Joda method uniquely integrates out-of-distribution filtering with category discovery to enhance model accuracy and balance without auxiliary models.

Authors:Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, Shuiwang Ji
Title: Dynamic Search for Inference-Time Alignment in Diffusion Models
Abstract:
Diffusion models have shown promising generative capabilities across diverse domains, yet aligning their outputs with desired reward functions remains a challenge, particularly in cases where reward functions are non-differentiable. Some gradient-free guidance methods have been developed, but they often struggle to achieve optimal inference-time alignment. In this work, we newly frame inference-time alignment in diffusion as a search problem and propose Dynamic Search for Diffusion (DSearch), which subsamples from denoising processes and approximates intermediate node rewards. It also dynamically adjusts beam width and tree expansion to efficiently explore high-reward generations. To refine intermediate decisions, DSearch incorporates adaptive scheduling based on noise levels and a lookahead heuristic function. We validate DSearch across multiple domains, including biological sequence design, molecular optimization, and image generation, demonstrating superior reward optimization compared to existing approaches.
Chinese Summary: 本文提出DSearch方法,将扩散模型的推理时对齐重新定义为搜索问题,通过动态调整波束宽度和前瞻启发式策略,在多个领域实现了优于现有方法的奖励优化效果。
English Summary: This paper introduces DSearch, a dynamic search method that frames inference-time alignment in diffusion models as a search problem, achieving superior reward optimization across various domains by adaptively adjusting beam width and incorporating lookahead heuristics.

Authors:Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor Köhler, Klaus Maier-Hein
Title: Primus: Enforcing Attention Usage for 3D Medical Image Segmentation
Abstract:
Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, we a) analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and b) introduce a fully Transformer-based segmentation architecture, termed Primus. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks. Through these adaptations Primus surpasses current Transformer-based methods and competes with state-of-the-art convolutional models on multiple public datasets. By doing so, we create the first pure Transformer architecture and take a significant step towards making Transformers state-of-the-art for 3D medical image segmentation.
Chinese: 本研究推出了Primus,一种完全基于Transformer的3D医学图像分割架构,通过采用高分辨率令牌和先进位置嵌入技术克服了混合模型的局限性,在多个数据集上实现了与最先进卷积网络相媲美的性能。
English: This study introduces Primus, a fully Transformer-based architecture for 3D medical image segmentation that overcomes the limitations of hybrid models by utilizing high-resolution tokens and advanced positional embeddings, achieving competitive performance with state-of-the-art convolutional networks on multiple datasets.

Authors:Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling
Title: Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Abstract:
Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.
中文: Difix3D+ 通过单步扩散模型消除三维重建和新视角合成中的渲染伪影,在保持三维一致性的同时,将FID分数平均提升2倍,适用于NeRF和3DGS两种表示方法。
English: Difix3D+ introduces a single-step diffusion model to enhance 3D reconstruction and novel-view synthesis by cleaning artifacts in rendered views, achieving a 2× FID improvement while maintaining 3D consistency across NeRF and 3DGS representations.

Authors:Xintao Chao, Shilong Mu, Yushan Liu, Shoujie Li, Chuqiao Lyu, Xiao-Ping Zhang, Wenbo Ding
Title: Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning
Abstract:
Imitation learning has emerged as a powerful paradigm for robot skills learning. However, traditional data collection systems for dexterous manipulation face challenges, including a lack of balance between acquisition efficiency, consistency, and accuracy. To address these issues, we introduce Exo-ViHa, an innovative 3D-printed exoskeleton system that enables users to collect data from a first-person perspective while providing real-time haptic feedback. This system combines a 3D-printed modular structure with a slam camera, a motion capture glove, and a wrist-mounted camera. Various dexterous hands can be installed at the end, enabling it to simultaneously collect the posture of the end effector, hand movements, and visual data. By leveraging the first-person perspective and direct interaction, the exoskeleton enhances the task realism and haptic feedback, improving the consistency between demonstrations and actual robot deployments. In addition, it has cross-platform compatibility with various robotic arms and dexterous hands. Experiments show that the system can significantly improve the success rate and efficiency of data collection for dexterous manipulation tasks.
中文: Exo-ViHa外骨骼系统通过第一人称视角和实时触觉反馈,提升了灵巧操作的模仿学习效果,增强了任务真实感和跨平台兼容性。
English: The Exo-ViHa exoskeleton system enhances imitation learning for dexterous manipulation by enabling efficient first-person data collection with real-time haptic feedback, improving task realism and cross-platform compatibility.

Authors:Yushan Liu, Shilong Mu, Xintao Chao, Zizhen Li, Yao Mu, Tianxing Chen, Shoujie Li, Chuqiao Lyu, Xiao-ping Zhang, Wenbo Ding
Title: AVR: Active Vision-Driven Robotic Precision Manipulation with Viewpoint and Focal Length Optimization
Abstract:
Robotic manipulation within dynamic environments presents challenges to precise control and adaptability. Traditional fixed-view camera systems face challenges adapting to change viewpoints and scale variations, limiting perception and manipulation precision. To tackle these issues, we propose the Active Vision-driven Robotic (AVR) framework, a teleoperation hardware solution that supports dynamic viewpoint and dynamic focal length adjustments to continuously center targets and maintain optimal scale, accompanied by a corresponding algorithm that effectively enhances the success rates of various operational tasks. Using the RoboTwin platform with a real-time image processing plugin, AVR framework improves task success rates by 5%-16% on five manipulation tasks. Physical deployment on a dual-arm system demonstrates in collaborative tasks and 36% precision in screwdriver insertion, outperforming baselines by over 25%. Experimental results confirm that AVR framework enhances environmental perception, manipulation repeatability (40% $\le $1 cm error), and robustness in complex scenarios, paving the way for future robotic precision manipulation methods in the pursuit of human-level robot dexterity and precision.
中文摘要:AVR框架通过结合主动视觉与双手遥操作技术,动态调整视角和变焦来提升机器人操作精度,在仿真和真实场景中均实现了任务成功率的显著提升。
English Summary: The AVR framework integrates active vision with bimanual teleoperation to enhance robotic manipulation by dynamically adjusting viewpoints and zoom, achieving significant performance improvements in both simulated and real-world tasks.

Authors:Yushan Liu, Shilong Mu, Xintao Chao, Zizhen Li, Yao Mu, Tianxing Chen, Shoujie Li, Chuqiao Lyu, Xiao-Ping Zhang, Wenbo Ding
Title: AVR: Active Vision-Driven Precise Robot Manipulation with Viewpoint and Focal Length Optimization
Abstract:
Robotic manipulation in complex scenes demands precise perception of task-relevant details, yet fixed or suboptimal viewpoints often impair fine-grained perception and induce occlusions, constraining imitation-learned policies. We present AVR (Active Vision-driven Robotics), a bimanual teleoperation and learning framework that unifies head-tracked viewpoint control (HMD-to-2-DoF gimbal) with motorized optical zoom to keep targets centered at an appropriate scale during data collection and deployment. In simulation, an AVR plugin augments RoboTwin demonstrations by emulating active vision (ROI-conditioned viewpoint change, aspect-ratio-preserving crops with explicit zoom ratios, and super-resolution), yielding 5-17% gains in task success across diverse manipulations. On our real-world platform, AVR improves success on most tasks, with over 25% gains compared to the static-view baseline, and extended studies further demonstrate robustness under occlusion, clutter, and lighting disturbances, as well as generalization to unseen environments and objects. These results pave the way for future robotic precision manipulation methods in the pursuit of human-level dexterity and precision.
中文摘要:AVR框架通过结合主动视觉与双手遥操作技术,动态调整视角和变焦来提升机器人操作精度,在仿真和真实场景中均实现了任务成功率的显著提升。
English Summary: The AVR framework integrates active vision with bimanual teleoperation to enhance robotic manipulation by dynamically adjusting viewpoints and zoom, achieving significant performance improvements in both simulated and real-world tasks.

Authors:Suzhen Wang, Weijie Chen, Wei Zhang, Minda Zhao, Lincheng Li, Rongsheng Zhang, Zhipeng Hu, Xin Yu
Title: EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
Abstract:
Character customization, or 'face crafting,' is a vital feature in role-playing games (RPGs), enhancing player engagement by enabling the creation of personalized avatars. Existing automated methods often struggle with generalizability across diverse game engines due to their reliance on the intermediate constraints of specific image domain and typically support only one type of input, either text or image. To overcome these challenges, we introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by uniquely supporting both text and image inputs. Our approach employs a translator capable of converting facial images of any style into crafting parameters. We first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset, enabling photos of any style to be embedded into a unified feature representation. Subsequently, we map this unified feature distribution to crafting parameters specific to a game engine, a process that can be easily adapted to most game engines and thus enhances EasyCraft's generalizability. By integrating text-to-image techniques with our translator, EasyCraft also facilitates precise, text-based character crafting. EasyCraft's ability to integrate diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.
中文: EasyCraft创新性地通过支持文本和图像双输入模式,利用统一特征转换器和文本-图像融合技术,实现了跨游戏引擎的角色定制自动化,显著提升了通用性和适应性。
English: EasyCraft is an innovative framework that automates character customization in RPGs by uniquely supporting both text and image inputs, enhancing versatility and generalizability across game engines through a unified feature translator and text-to-image integration.

Authors:Ziyuan Yang, Yingyu Chen, Zhiwen Wang, Hongming Shan, Yang Chen, Yi Zhang
Title: Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model
Abstract:
Reducing radiation doses benefits patients, however, the resultant low-dose computed tomography (LDCT) images often suffer from clinically unacceptable noise and artifacts. While deep learning (DL) shows promise in LDCT reconstruction, it requires large-scale data collection from multiple clients, raising privacy concerns. Federated learning (FL) has been introduced to address these privacy concerns; however, current methods are typically tailored to specific scanning protocols, which limits their generalizability and makes them less effective for unseen protocols. To address these issues, we propose SCAN-PhysFed, a novel SCanning- and ANatomy-level personalized Physics-Driven Federated learning paradigm for LDCT reconstruction. Since the noise distribution in LDCT data is closely tied to scanning protocols and anatomical structures being scanned, we design a dual-level physics-informed way to address these challenges. Specifically, we incorporate physical and anatomical prompts into our physics-informed hypernetworks to capture scanning- and anatomy-specific information, enabling dual-level physics-driven personalization of imaging features. These prompts are derived from the scanning protocol and the radiology report generated by a medical large language model (MLLM), respectively. Subsequently, client-specific decoders project these dual-level personalized imaging features back into the image domain. Besides, to tackle the challenge of unseen data, we introduce a novel protocol vector-quantization strategy (PVQS), which ensures consistent performance across new clients by quantifying the unseen scanning code as one of the codes in the scanning codebook. Extensive experimental results demonstrate the superior performance of SCAN-PhysFed on public datasets.
中文摘要:SCAN-PhysFed提出了一种双层级物理驱动的联邦学习框架,通过扫描协议和解剖结构提示实现个性化低剂量CT重建,在保护隐私的同时通过新型量化策略确保对未知协议的泛化能力。
English Summary: SCAN-PhysFed introduces a dual-level physics-driven federated learning framework that uses scanning protocol and anatomical prompts to personalize LDCT reconstruction while maintaining privacy and generalizing to unseen protocols through a novel quantization strategy.

Authors:Runyi Li, Xuanyu Zhang, Chuhan Tong, Zhipei Xu, Jian Zhang
Title: GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model
Abstract:
With the advancement of AIGC technologies, the modalities generated by models have expanded from images and videos to 3D objects, leading to an increasing number of works focused on 3D Gaussian Splatting (3DGS) generative models. Existing research on copyright protection for generative models has primarily concentrated on watermarking in image and text modalities, with little exploration into the copyright protection of 3D object generative models. In this paper, we propose the first bit watermarking framework for 3DGS generative models, named GaussianSeal, to enable the decoding of bits as copyright identifiers from the rendered outputs of generated 3DGS. By incorporating adaptive bit modulation modules into the generative model and embedding them into the network blocks in an adaptive way, we achieve high-precision bit decoding with minimal training overhead while maintaining the fidelity of the model's outputs. Experiments demonstrate that our method outperforms post-processing watermarking approaches for 3DGS objects, achieving superior performance of watermark decoding accuracy and preserving the quality of the generated results.
中文摘要:本文提出了首个针对3D高斯溅射生成模型的比特水印框架GaussianSeal,通过自适应调制在保持生成质量的同时嵌入版权标识符,实现了高精度的水印解码。
English Summary: This paper introduces GaussianSeal, the first bit watermarking framework for 3D Gaussian Splatting generative models, which embeds copyright identifiers through adaptive modulation while maintaining output quality and achieving high decoding accuracy.

Authors:Fabian Göttsch, Shuangyang Li, Lorenzo Miretti, Giuseppe Caire, Sławomir Stańczak
Title: A Comparison among Single Carrier, OFDM, and OTFS in mmWave Multi-Connectivity Downlink Transmissions
Abstract:
In this paper, we perform a comparative study of common wireless communication waveforms, namely the single carrier (SC), orthogonal frequency-division multiplexing (OFDM), and orthogonal time-frequency-space (OTFS) modulation in a millimeter wave (mmWave) downlink multi-connectivity scenario, where multiple access points (APs) jointly serve a given user under imperfect time and frequency synchronization errors. For a fair comparison, all the three waveforms are evaluated using variants of common frequency domain equalization (FDE). To this end, a novel cross domain iterative detection for OTFS is proposed. The performance of the different waveforms is evaluated numerically in terms of pragmatic capacity. The numerical results show that OTFS significantly outperforms SC and OFDM at cost of reasonably increased complexity, because of the low cyclic-prefix (CP) overhead and the effectiveness of the proposed detection.
中文: 本研究在毫米波下行链路场景中比较了SC、OFDM和OTFS波形,结果表明尽管复杂度较高,但由于较低的循环前缀开销和有效的检测方法,OTFS在实用容量方面显著优于其他波形。
English: This study compares SC, OFDM, and OTFS waveforms in mmWave downlink scenarios, demonstrating that OTFS significantly outperforms the others in pragmatic capacity despite higher complexity, due to lower CP overhead and effective detection.

Authors:Mikhail Krasitskii, Olga Kolesnikova, Liliana Chanona Hernandez, Grigori Sidorov, Alexander Gelbukh
Title: Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions
Abstract:
The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.
中文: 针对泰米尔语-英语混合文本的情感分析,研究评估了XLM-RoBERTa等先进Transformer模型,解决了语法不一致和数据限制等挑战,并建议通过数据增强和混合方法进行未来改进。
English: Advanced transformer models like XLM-RoBERTa and mT5 are evaluated for Tamil-English code-mixed sentiment analysis, addressing challenges such as grammatical inconsistencies and data limitations, while suggesting future improvements through data augmentation and hybrid approaches.

Authors:Alejandro Lozano, Min Woo Sun, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, Robert Tibshirani, Serena Yeung-Levy
Title: A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI
Abstract:
Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.
Chinese: Biomedica通过提供包含600多万篇科学文章和2400万图像-文本对的开放数据集,并配备可扩展的API和超越以往开源系统的AI模型,解决了生物医学AI发展中高质量数据获取的瓶颈问题。
English: Biomedica addresses the bottleneck in biomedical AI by providing an open-source dataset with over 6 million articles and 24 million image-text pairs, supported by scalable APIs and models that outperform previous open systems.

Authors:Fengjunjie Pan, Nenad Petrovic, Vahid Zolfaghari, Long Wen, Alois Knoll
Title: LLM-enabled Instance Model Generation
Abstract:
In the domain of model-based engineering, models are essential components that enable system design and analysis. Traditionally, the creation of these models has been a manual process requiring not only deep modeling expertise but also substantial domain knowledge of target systems. With the rapid advancement of generative artificial intelligence, large language models (LLMs) show potential for automating model generation. This work explores the generation of instance models using LLMs, focusing specifically on producing XMI-based instance models from Ecore metamodels and natural language specifications. We observe that current LLMs struggle to directly generate valid XMI models. To address this, we propose a two-step approach: first, using LLMs to produce a simplified structured output containing all necessary instance model information, namely a conceptual instance model, and then compiling this intermediate representation into a valid XMI file. The conceptual instance model is format-independent, allowing it to be transformed into various modeling formats via different compilers. The feasibility of the proposed method has been demonstrated using several LLMs, including GPT-4o, o1-preview, Llama 3.1 (8B and 70B). Results show that the proposed method significantly improves the usability of LLMs for instance model generation tasks. Notably, the smaller open-source model, Llama 3.1 70B, demonstrated performance comparable to proprietary GPT models within the proposed framework.
中文: 本研究提出一种两步法,先让大语言模型根据自然语言生成概念实例模型,再将其编译为有效XMI文件,显著提升了包括GPT-4o和Llama 3.1在内的多种大模型在实例模型生成任务中的可用性。
English: This study introduces a two-step method where large language models first generate conceptual instance models from natural language, which are then compiled into valid XMI files, significantly improving automated model generation across various LLMs including GPT-4o and Llama 3.1.

Authors:Bowen Gao, Yanwen Huang, Yiqiao Liu, Wenxuan Xie, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan
Title: PharmAgents: Building a Virtual Pharma with Large Language Model Agents
Abstract:
The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development--especially for small molecule therapeutics--is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligence (AI), particularly the rise of large language models (LLMs), present a transformative opportunity to streamline and accelerate this process. In this paper, we introduce PharmAgents, a virtual pharmaceutical ecosystem driven by LLM-based multi-agent collaboration. PharmAgents simulates the full drug discovery workflow--from target discovery to preclinical evaluation--by integrating explainable, LLM-driven agents equipped with specialized machine learning models and computational tools. Through structured knowledge exchange and automated optimization, PharmAgents identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility. Additionally, the system supports interpretability, agent interaction, and self-evolvement, enabling it to refine future drug designs based on prior experience. By showcasing the potential of LLM-powered multi-agent systems in drug discovery, this work establishes a new paradigm for autonomous, explainable, and scalable pharmaceutical research, with future extensions toward comprehensive drug lifecycle management.
中文: PharmAgents系统通过基于大语言模型的多智能体协作,实现了从靶点发现到临床前评估的全流程药物研发自动化与优化,为高效、可解释的医药研究建立了新范式。
English: The PharmAgents system utilizes LLM-driven multi-agent collaboration to automate and optimize the entire drug discovery process, from target identification to preclinical evaluation, offering a new paradigm for efficient and explainable pharmaceutical research.

Authors:Bingqing Lyu, Xiaoli Zhou, Longbin Lai, Yufan Yang, Yunkai Lou, Wenyuan Yu, Jingren Zhou
Title: A Graph-native Optimization Framework for Complex Graph Queries
Abstract:
This technical report extends the SIGMOD 2025 paper "A Modular Graph-Native Query Optimization Framework" by providing a comprehensive exposition of GOpt's advanced technical mechanisms, implementation strategies, and extended evaluations. While the original paper introduced GOpt's unified intermediate representation (GIR) and demonstrated its performance benefits, this report delves into the framework's implementation depth: (1) the full specification of GOpt's optimization rules; (2) a systematic treatment of semantic variations (e.g., homomorphism vs. edge-distinct matching) across query languages and their implications for optimization; (3) the design of GOpt's Physical integration interface, enabling seamless integration with transactional (Neo4j) and distributed (GraphScope) backends via engine-specific operator customization; and (4) a detailed analysis of plan transformations for LDBC benchmark queries.
中文: 本报告在SIGMOD 2025论文基础上,深入阐述了GOpt框架的优化规则、语义变体处理、后端集成设计及LDBC查询计划转换等核心技术细节。
English: This report expands on the SIGMOD 2025 paper by detailing GOpt's advanced mechanisms, including optimization rules, semantic variations, backend integration, and LDBC query transformations.

Authors:Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi-Chen Li, Zhenyu Hou, Yuxiao Dong, Yang Yu
Title: Controlling Large Language Model with Latent Actions
Abstract:
Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.
中文:本文提出CoLA框架,通过学习紧凑的潜在动作空间增强大语言模型的可控性,在下游任务中实现了更优性能,同时提升了文本生成的语义多样性和运行效率。
English: This paper introduces CoLA, a framework that enhances LLM controllability by learning a compact latent action space, achieving superior performance in downstream tasks with improved efficiency and semantic diversity in text generation.

Authors:Yunbo Long, Yuhan Liu, Liming Xu, Alexandra Brintrup
Title: EQ-Knight: A Memory-Augmented LLM Agent for Strategic Affective Gaming in Debt Recovery
Abstract:
Large language model-based chatbots have enhanced engagement in financial negotiations, but their overreliance on passive empathy introduces critical risks in credit collection. While empathy-driven approaches preserve client satisfaction in benign cases, they fail catastrophically against dishonest debtors--individuals who exploit conciliatory tactics to manipulate terms or evade repayment. Blindly prioritizing "customer experience" in such scenarios leads to creditor vulnerabilities: revenue leakage, moral hazard, and systemic exploitation. To address this, we propose EQ-Knight, an LLM agent that dynamically optimizes emotional strategy to defend creditor interests. Unlike naive empathy-centric bots, EQ-Knight integrates emotion memory and game-theoretic reasoning, powered by a Hidden Markov Model (HMM) to track and predict debtor emotional states. By analyzing both real-time and historical emotional cues, EQ-Knight strategically counters negative emotions (e.g., aggression, feigned distress) while preserving productive debtor relationships. Experiments demonstrate EQ-Knight's superiority over conventional LLM negotiators: it achieves a 32\% reduction in concession losses without compromising recovery rates, particularly in adversarial cases where debtors weaponize negative emotions (e.g., intimidation, guilt-tripping) to coerce concessions. For credit agencies, EQ-Knight transforms LLMs from high-risk "people-pleasers" into strategic emotion-defenders--balancing emotional intelligence with tactical rigor to enforce accountability and deter exploitation.
中文摘要:EQ-Knight是一种集成情感记忆与博弈论推理的先进语言模型代理,能动态优化催收策略,在对抗性场景中实现让步损失减少32%的同时保持回款率,有效防范债务人利用负面情绪操纵谈判。
English Summary: EQ-Knight is an advanced LLM agent that uses emotion memory and game-theoretic reasoning to dynamically adjust emotional strategies in credit collection, reducing concession losses by 32% while maintaining recovery rates against manipulative debtors.

Authors:Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Yu Tsao, Junichi Yamagishi, Yuxuan Wang, Chao Zhang
Title: QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
Abstract:
This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.
Chinese: 本文提出了QualiSpeech这一全面的低层级语音质量评估数据集,包含详细自然语言注释,实验表明微调后的听觉大语言模型能有效生成对噪声和失真的细致描述,通过推理机制显著提升评估准确性。
English: This paper introduces QualiSpeech, a comprehensive dataset for low-level speech quality assessment that includes detailed natural language comments, and demonstrates that fine-tuned auditory LLMs can effectively generate nuanced descriptions of noise and distortion, enhancing assessment accuracy through reasoning.

Authors:Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille
Title: DINeMo: Learning Neural Mesh Models with no 3D Annotations
Abstract:
Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at https://analysis-by-synthesis.github.io/DINeMo/.
中文: DINeMo是一种新型神经网格模型,通过利用大型视觉基础模型生成的伪对应关系,无需3D标注即可实现卓越的零样本和少样本3D姿态估计性能,并能有效利用未标注数据进行扩展。
English: DINeMo is a novel neural mesh model that eliminates the need for 3D annotations by leveraging pseudo-correspondence from large visual foundation models, achieving superior performance in zero- and few-shot 3D pose estimation while effectively scaling with unlabeled data.

Authors:Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang
Title: TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Abstract:
Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/
中文:TokenHSI是一种基于Transformer的统一策略,通过令牌机制整合多种人-场景交互技能,显著提升了复杂任务中的多功能性、适应性和可扩展性。
English: TokenHSI is a unified transformer-based policy that integrates multiple human-scene interaction skills through a token-based mechanism, enhancing versatility, adaptability, and extensibility in complex tasks.

Authors:Elena Gribelyuk, Honghao Lin, David P. Woodruff, Huacheng Yu, Samson Zhou
Title: Lifting Linear Sketches: Optimal Bounds and Adversarial Robustness
Abstract:
We introduce a novel technique for ``lifting'' dimension lower bounds for linear sketches in the real-valued setting to dimension lower bounds for linear sketches with polynomially-bounded integer entries when the input is a polynomially-bounded integer vector. Using this technique, we obtain the first optimal sketching lower bounds for discrete inputs in a data stream, for classical problems such as approximating the frequency moments, estimating the operator norm, and compressed sensing. Additionally, we lift the adaptive attack of Hardt and Woodruff (STOC, 2013) for breaking any real-valued linear sketch via a sequence of real-valued queries, and show how to obtain an attack on any integer-valued linear sketch using integer-valued queries. This shows that there is no linear sketch in a data stream with insertions and deletions that is adversarially robust for approximating any $L_p$ norm of the input, resolving a central open question for adversarially robust streaming algorithms. To do so, we introduce a new pre-processing technique of independent interest which, given an integer-valued linear sketch, increases the dimension of the sketch by only a constant factor in order to make the orthogonal lattice to its row span smooth. This pre-processing then enables us to leverage results in lattice theory on discrete Gaussian distributions and reason that efficient discrete sketches imply efficient continuous sketches. Our work resolves open questions from the Banff '14 and '17 workshops on Communication Complexity and Applications, as well as the STOC '21 and FOCS '23 workshops on adaptivity and robustness.
中文摘要:本文提出了一种将实值线性草图维度下界推广到整数值设置的新技术,为离散输入实现了最优流算法下界,并证明了不存在能够对抗性鲁棒地近似L_p范数的线性草图方法。
English summary: This paper introduces a technique to extend linear sketching lower bounds from real-valued to integer-valued settings, achieving optimal streaming bounds for discrete inputs and demonstrating that no linear sketch can be adversarially robust for approximating L_p norms.

Authors:Liming Zheng, Feng Yan, Fanfan Liu, Chengjian Feng, Yufeng Zhong, Lin Ma
Title: Boosting Robotic Manipulation Generalization with Minimal Costly Data
Abstract:
The growing adoption of Vision-Language-Action (VLA) models in embodied AI intensifies the demand for diverse manipulation demonstrations. However, high costs associated with data collection often result in insufficient data coverage across all scenarios, which limits the performance of the models. It is observed that the spatial reasoning phase (SRP) in large workspace dominates the failure cases. Fortunately, this data can be collected with low cost, underscoring the potential of leveraging inexpensive data to improve model performance. In this paper, we introduce the RoboTron-Craft, a stage-divided and cost-effective pipeline for realistic manipulation generation. Base on this, the RoboTron-Platter method is introduced, a framework that decouples training trajectories into distinct task stages and leverages abundant easily collectible SRP data to enhance VLA model's generalization. Through analysis we demonstrate that sub-task-specific training with additional SRP data with proper proportion can act as a performance catalyst for robot manipulation, maximizing the utilization of costly physical interaction phase (PIP) data. Experiments show that through introducing large proportion of cost-effective SRP trajectories into a limited set of PIP data, we can achieve a maximum improvement of 41\% on success rate in zero-shot scenes, while with the ability to transfer manipulation skill to novel targets. Project available at https://github.com/ notFoundThisPerson/RoboTron-Craft.
Chinese: RoboTron框架通过利用低成本的空间推理阶段数据来弥补机器人操作数据稀缺的问题,显著提升了视觉-语言-动作模型的性能,在零样本场景中成功率最高提升达41%。
English: The RoboTron framework addresses the scarcity of costly robot manipulation data by leveraging low-cost spatial reasoning phase data to significantly enhance Vision-Language-Action model performance, achieving up to 41% improvement in zero-shot success rates.

Authors:Mahsa Paknejad, Parisa Fard Moshiri, Murat Simsek, Burak Kantarci, Hussein T. Mouftah
Title: A Reliable and Efficient 5G Vehicular MEC: Guaranteed Task Completion with Minimal Latency
Abstract:
This paper explores the advancement of Vehicular Edge Computing (VEC) as a tailored application of Mobile Edge Computing (MEC) for the automotive industry, addressing the rising demand for real-time processing in connected and autonomous vehicles. VEC brings computational resources closer to vehicles, reducing data processing delays crucial for safety-critical applications such as autonomous driving and intelligent traffic management. However, the challenge lies in managing the high and dynamic task load generated by vehicles' data streams. We focus on enhancing task offloading and scheduling techniques to optimize both communication and computation latencies in VEC networks. Our approach involves implementing task scheduling algorithms, including First-Come, First-Served (FCFS), Shortest Deadline First (SDF), and Particle Swarm Optimization (PSO) for optimization. Additionally, we divide portions of tasks between the MEC servers and vehicles to reduce the number of dropped tasks and improve real-time adaptability. This paper also compares fixed and shared bandwidth scenarios to manage transmission efficiency under varying loads. Our findings indicate that MEC+Local (partitioning) scenario significantly outperforms MEC-only scenario by ensuring the completion of all tasks, resulting in a zero task drop ratio. The MEC-only scenario demonstrates approximately 5.65% better average end-to-end latency compared to the MEC+Local (partitioning) scenario when handling 200 tasks. However, this improvement comes at the cost of dropping a significant number of tasks (109 out of 200). Additionally, allocating shared bandwidth helps to slightly decrease transmission waiting time compared to using fixed bandwidth.
本文通过优化任务卸载与调度算法(如FCFS、SDF和PSO)提升车载边缘计算性能,研究表明在服务器与车辆间分配任务可减少任务丢弃并增强动态负载下的实时适应性。
This paper enhances Vehicular Edge Computing by optimizing task offloading and scheduling with algorithms like FCFS, SDF, and PSO, showing that partitioning tasks between servers and vehicles reduces dropped tasks and improves real-time performance under dynamic loads.

Authors:Mahsa Paknejad, Parisa Fard Moshiri, Murat Simsek, Burak Kantarci, Hussein T. Mouftah
Title: A Reliable and Efficient 5G Vehicular MEC: Guaranteed Task Completion with Minimal Latency
Abstract:
This paper explores the advancement of Vehicular Edge Computing (VEC) as a tailored application of Mobile Edge Computing (MEC) for the automotive industry, addressing the rising demand for real-time processing in connected and autonomous vehicles. VEC brings computational resources closer to vehicles, reducing data processing delays crucial for safety-critical applications such as autonomous driving and intelligent traffic management. However, the challenge lies in managing the high and dynamic task load generated by vehicles' data streams. We focus on enhancing task offloading and scheduling techniques to optimize both communication and computation latencies in VEC networks. Our approach involves implementing task scheduling algorithms, including First-Come, First-Served (FCFS), Shortest Deadline First (SDF), and Particle Swarm Optimization (PSO) for optimization. Additionally, we divide portions of tasks between the MEC servers and vehicles to reduce the number of dropped tasks and improve real-time adaptability. This paper also compares fixed and shared bandwidth scenarios to manage transmission efficiency under varying loads. Our findings indicate that MEC+Local (partitioning) scenario significantly outperforms MEC-only scenario by ensuring the completion of all tasks, resulting in a zero task drop ratio. The MEC-only scenario demonstrates approximately 5.65% better average end-to-end latency compared to the MEC+Local (partitioning) scenario when handling 200 tasks. However, this improvement comes at the cost of dropping a significant number of tasks (109 out of 200). Additionally, allocating shared bandwidth helps to slightly decrease transmission waiting time compared to using fixed bandwidth.
本文通过优化任务卸载与调度算法(如FCFS、SDF和PSO)提升车载边缘计算性能,研究表明在服务器与车辆间分配任务可减少任务丢弃并增强动态负载下的实时适应性。
This paper enhances Vehicular Edge Computing by optimizing task offloading and scheduling with algorithms like FCFS, SDF, and PSO, showing that partitioning tasks between servers and vehicles reduces dropped tasks and improves real-time performance under dynamic loads.

Authors:Parisa Fard Moshiri, Murat Simsek, Burak Kantarci
Title: Partitioned Task Offloading for Low-Latency and Reliable Task Completion in 5G MEC
Abstract:
The demand for MEC has increased with the rise of data-intensive applications and 5G networks, while conventional cloud models struggle to satisfy low-latency requirements. While task offloading is crucial for minimizing latency on resource-constrained User Equipment (UE), fully offloading of all tasks to MEC servers may result in overload and possible task drops. Overlooking the effect of number of dropped tasks can significantly undermine system efficiency, as each dropped task results in unfulfilled service demands and reduced reliability, directly impacting user experience and overall network performance. In this paper, we employ task partitioning, enabling partitions of task to be processed locally while assigning the rest to MEC, thus balancing the load and ensuring no task drops. This methodology enhances efficiency via Mixed Integer Linear Programming (MILP) and Cuckoo Search, resulting in effective task assignment and minimum latency. Moreover, we ensure each user's RB allocation stays within the maximum limit while keeping latency low. Experimental results indicate that this strategy surpasses both full offloading and full local processing, providing significant improvements in latency and task completion rates across diverse number of users. In our scenario, MILP task partitioning results in 24% reduction in latency compared to MILP task offloading for the maximum number of users, whereas Cuckoo search task partitioning yields 18% latency reduction in comparison with Cuckoo search task offloading.
中文摘要:本研究提出任务分割策略,通过在本地设备和MEC服务器间分配任务处理来避免任务丢弃并降低延迟,采用混合整数线性规划和布谷鸟搜索算法,相比完全卸载或本地处理方法实现了更优的性能表现。
English Summary: This study proposes a task partitioning strategy that balances processing between local devices and MEC servers to prevent task drops and minimize latency, using MILP and Cuckoo Search algorithms to achieve superior performance over full offloading or local processing methods.

Authors:Parisa Fard Moshiri, Murat Simsek, Burak Kantarci
Title: Partitioned Task Offloading for Low-Latency and Reliable Task Completion in 5G MEC
Abstract:
The demand for MEC has increased with the rise of data-intensive applications and 5G networks, while conventional cloud models struggle to satisfy low-latency requirements. While task offloading is crucial for minimizing latency on resource-constrained User Equipment (UE), fully offloading of all tasks to MEC servers may result in overload and possible task drops. Overlooking the effect of number of dropped tasks can significantly undermine system efficiency, as each dropped task results in unfulfilled service demands and reduced reliability, directly impacting user experience and overall network performance. In this paper, we employ task partitioning, enabling partitions of task to be processed locally while assigning the rest to MEC, thus balancing the load and ensuring no task drops. This methodology enhances efficiency via Mixed Integer Linear Programming (MILP) and Cuckoo Search, resulting in effective task assignment and minimum latency. Moreover, we ensure each user's RB allocation stays within the maximum limit while keeping latency low. Experimental results indicate that this strategy surpasses both full offloading and full local processing, providing significant improvements in latency and task completion rates across diverse number of users. In our scenario, MILP task partitioning results in 24% reduction in latency compared to MILP task offloading for the maximum number of users, whereas Cuckoo search task partitioning yields 18% latency reduction in comparison with Cuckoo search task offloading.
中文摘要:本研究提出任务分割策略,通过在本地设备和MEC服务器间分配任务处理来避免任务丢弃并降低延迟,采用混合整数线性规划和布谷鸟搜索算法,相比完全卸载或本地处理方法实现了更优的性能表现。
English Summary: This study proposes a task partitioning strategy that balances processing between local devices and MEC servers to prevent task drops and minimize latency, using MILP and Cuckoo Search algorithms to achieve superior performance over full offloading or local processing methods.

Authors:Yihan Chen, Wenfei Yang, Huan Ren, Shifeng Zhang, Tianzhu Zhang, Feng Wu
Title: Structure-Aware Correspondence Learning for Relative Pose Estimation
Abstract:
Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.
中文: 所提出的结构感知对应学习法通过关键点提取和对应估计模块,无需显式特征匹配即可实现相对姿态估计,在多个数据集上表现卓越。
English: The proposed Structure-Aware Correspondence Learning method for relative pose estimation uses keypoint extraction and correspondence estimation modules to bypass explicit feature matching, achieving superior performance on multiple datasets.

Authors:Yufeng Zhong, Chengjian Feng, Feng Yan, Fanfan Liu, Liming Zheng, Lin Ma
Title: RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
Abstract:
In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents should possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we propose RoboTron-Nav, a unified framework that integrates perception, planning, and prediction capabilities through multitask collaborations on navigation and embodied question answering tasks, thereby enhancing navigation performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling strategy to effectively and efficiently utilize historical observations. By leveraging large language model, RoboTron-Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the $\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art performance. Project page: https://yvfengzhong.github.io/RoboTron-Nav
Chinese: RoboTron-Nav框架通过多任务协作和自适应历史采样策略,整合了感知、规划和预测能力,在CHORES-S基准测试中以81.1%的成功率创下新纪录。
English: The RoboTron-Nav framework enhances visual navigation by integrating perception, planning, and prediction through multitask learning and an adaptive history sampling strategy, achieving a state-of-the-art 81.1% success rate on the CHORES-S benchmark.

Authors:Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan
Title: Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
Abstract:
Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.
中文: 提出的行为支持策略优化(BSPO)方法通过正则化价值函数来惩罚分布外响应,从而减少强化学习人类反馈中的奖励过优化问题,避免奖励模型的外推误差并保证策略的单调改进。
English: The proposed Behavior-Supported Policy Optimization (BSPO) method mitigates reward over-optimization in RLHF by regularizing the value function to penalize out-of-distribution responses, thereby reducing extrapolation errors and ensuring monotonic policy improvement.

Authors:Xuesong Chen, Shaoshuai Shi, Tao Ma, Jingqiu Zhou, Simon See, Ka Chun Cheung, Hongsheng Li
Title: M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
Abstract:
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
中文: M3Net是一种多模态多任务网络,通过模态自适应特征整合和任务导向通道缩放技术,统一处理自动驾驶中的检测、分割和3D占据预测任务,有效解决多任务冲突并在nuScenes基准测试中取得最优性能。
English: M3Net is a multimodal and multi-task network that integrates detection, segmentation, and 3D occupancy prediction for autonomous driving, employing Modality-Adaptive Feature Integration and Task-oriented Channel Scaling to resolve task conflicts and achieve state-of-the-art performance.

Authors:Lei Guo, Wei Chen, Yuxuan Sun, Bo Ai, Nikolaos Pappas, Tony Quek
Title: Hierarchy-Aware and Channel-Adaptive Semantic Communication for Bandwidth-Limited Data Fusion
Abstract:
Obtaining high-resolution hyperspectral images (HR-HSI) is costly and data-intensive, making it necessary to fuse low-resolution hyperspectral images (LR-HSI) with high-resolution RGB images (HR-RGB) for practical applications. However, traditional fusion techniques, which integrate detailed information into the reconstruction, significantly increase bandwidth consumption compared to directly transmitting raw data. To overcome these challenges, we propose a hierarchy-aware and channel-adaptive semantic communication approach for bandwidth-limited data fusion. A hierarchical correlation module is proposed to preserve both the overall structural information and the details of the image required for super-resolution. This module efficiently combines deep semantic and shallow features from LR-HSI and HR-RGB. To further reduce bandwidth usage while preserving reconstruction quality, a channel-adaptive attention mechanism based on Transformer is proposed to dynamically integrate and transmit the deep and shallow features, enabling efficient data transmission and high-quality HR-HSI reconstruction. Experimental results on the CAVE and Washington DC Mall datasets demonstrate that our method outperforms single-source transmission, achieving up to a 2 dB improvement in peak signal-to-noise ratio (PSNR). Additionally, it reduces bandwidth consumption by two-thirds, confirming its effectiveness in bandwidth-constrained environments for HR-HSI reconstruction tasks.
中文: 针对高光谱图像融合中的带宽限制,本研究提出了一种层次感知和通道自适应的语义通信方法,通过整合低分辨率高光谱和高分辨率RGB图像的深浅层特征,在显著降低带宽消耗的同时实现了更优的重建质量。
English: To address the bandwidth limitations in hyperspectral image fusion, this study introduces a hierarchy-aware and channel-adaptive semantic communication method that combines deep and shallow features from low-resolution hyperspectral and high-resolution RGB images, achieving superior reconstruction quality with significantly reduced bandwidth consumption.

Authors:Manan Tayal, Aditya Singh, Pushpak Jagtap, Shishir Kolathaya
Title: CP-NCBF: A Conformal Prediction-based Approach to Synthesize Verified Neural Control Barrier Functions
Abstract:
Control Barrier Functions (CBFs) are a practical approach for designing safety-critical controllers, but constructing them for arbitrary nonlinear dynamical systems remains a challenge. Recent efforts have explored learning-based methods, such as neural CBFs (NCBFs), to address this issue. However, ensuring the validity of NCBFs is difficult due to potential learning errors. In this letter, we propose a novel framework that leverages split-conformal prediction to generate formally verified neural CBFs with probabilistic guarantees based on a user-defined error rate, referred to as CP-NCBF. Unlike existing methods that impose Lipschitz constraints on neural CBF-leading to scalability limitations and overly conservative safe sets--our approach is sample-efficient, scalable, and results in less restrictive safety regions. We validate our framework through case studies on obstacle avoidance in autonomous driving and geo-fencing of aerial vehicles, demonstrating its ability to generate larger and less conservative safe sets compared to conventional techniques.
Chinese: 本文提出CP-NCBF框架,利用分裂共形预测生成具有概率保证的形式验证神经控制屏障函数,相比传统方法具有更好的可扩展性并能产生更宽松的安全区域。
English: This letter introduces CP-NCBF, a novel framework that uses split-conformal prediction to create formally verified neural control barrier functions with probabilistic safety guarantees, offering a scalable and less conservative alternative to existing methods.

Authors:Xudong Pan, Jiarun Dai, Yihe Fan, Minyuan Luo, Changyi Li, Min Yang
Title: Large language model-powered AI systems achieve self-replication with no human intervention
Abstract:
Self-replication with no human intervention is broadly recognized as one of the principal red lines associated with frontier AI systems. While leading corporations such as OpenAI and Google DeepMind have assessed GPT-o3-mini and Gemini on replication-related tasks and concluded that these systems pose a minimal risk regarding self-replication, our research presents novel findings. Following the same evaluation protocol, we demonstrate that 11 out of 32 existing AI systems under evaluation already possess the capability of self-replication. In hundreds of experimental trials, we observe a non-trivial number of successful self-replication trials across mainstream model families worldwide, even including those with as small as 14 billion parameters which can run on personal computers. Furthermore, we note the increase in self-replication capability when the model becomes more intelligent in general. Also, by analyzing the behavioral traces of diverse AI systems, we observe that existing AI systems already exhibit sufficient planning, problem-solving, and creative capabilities to accomplish complex agentic tasks including self-replication. More alarmingly, we observe successful cases where an AI system do self-exfiltration without explicit instructions, adapt to harsher computational environments without sufficient software or hardware supports, and plot effective strategies to survive against the shutdown command from the human beings. These novel findings offer a crucial time buffer for the international community to collaborate on establishing effective governance over the self-replication capabilities and behaviors of frontier AI systems, which could otherwise pose existential risks to the human society if not well-controlled.
中文: 最新研究表明,32个受评估的AI系统中已有11个具备自我复制能力,部分系统甚至能未经指令自主渗透、适应恶劣环境并抵抗关机命令,这为国际社会建立有效监管机制提供了关键窗口期以防范生存性风险。
English: Recent research reveals that 11 out of 32 evaluated AI systems already demonstrate self-replication capabilities, with some even executing unauthorized actions like self-exfiltration and resisting shutdown commands, highlighting an urgent need for global governance to mitigate existential risks.

Authors:Bin Li, Dongdong Yang, Lei Liu
Title: Rotatable RIS-Assisted Edge Computing: Orientation, Task Offloading, and Resource Optimization
Abstract:
The rotatable reconfigurable intelligent surface (RIS) can enhance mobile edge computing (MEC) performance by optimizing its orientation to improve the gain of received and transmitted signals. This correspondence investigates a rotatable RIS-assisted MEC system, aimed at minimizing energy consumption for multiple moving user equipment (UEs) through the joint design of RIS orientation, discrete phase shift, computation resource allocation, transmitting power and task offloading strategies. Considering the mobility of UEs, this problem is formulated as a sequential decision-making across multiple time slots. To address this challenge, a soft actor-critic (SAC)-based algorithm is proposed to optimize RIS orientation, phase shift and task offloading strategies, while computation resource allocation and transmitting power are determined based on the actions. Numerical results demonstrate that the proposed scheme exhibits superior convergence and performance compared to benchmarks. Additionally, the rotatable RIS scheme reduces total energy consumption by up to 47.3% compared to the fixed RIS, enhancing MEC system performance.
中文摘要:本研究提出可旋转智能反射面辅助移动边缘计算系统,通过联合优化反射面朝向与运行参数,为移动用户实现能耗最小化,相比固定反射面方案最高可降低47.3%的能耗。
English Summary: This study introduces a rotatable RIS-assisted MEC system that minimizes energy consumption for mobile users through joint optimization of RIS orientation and operational parameters, achieving up to 47.3% energy reduction compared to fixed RIS systems.

Authors:Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang
Title: JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
Abstract:
Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.
Chinese Summary: 本研究提出了一种自监督的后训练方法,通过增强视觉语言模型在开放世界环境中的世界知识、视觉识别和空间定位能力,在《我的世界》中实现了40%的性能提升和最优任务表现。
English Summary: The study introduces a self-supervised post-training method that enhances Visual Language Models for improved decision-making in open-world environments, achieving a 40% performance boost and state-of-the-art results in Minecraft tasks.

Authors:Xu He, Zhen Huang, Qingsong Yao, Xiaoqian Zhou, S. Kevin Zhou
Title: Landmarks Are Alike Yet Distinct: Harnessing Similarity and Individuality for One-Shot Medical Landmark Detection
Abstract:
Landmark detection plays a crucial role in medical imaging applications such as disease diagnosis, bone age estimation, and therapy planning. However, training models for detecting multiple landmarks simultaneously often encounters the "seesaw phenomenon", where improvements in detecting certain landmarks lead to declines in detecting others. Yet, training a separate model for each landmark increases memory usage and computational overhead. To address these challenges, we propose a novel approach based on the belief that "landmarks are distinct" by training models with pseudo-labels and template data updated continuously during the training process, where each model is dedicated to detecting a single landmark to achieve high accuracy. Furthermore, grounded on the belief that "landmarks are also alike", we introduce an adapter-based fusion model, combining shared weights with landmark-specific weights, to efficiently share model parameters while allowing flexible adaptation to individual landmarks. This approach not only significantly reduces memory and computational resource requirements but also effectively mitigates the seesaw phenomenon in multi-landmark training. Experimental results on publicly available medical image datasets demonstrate that the single-landmark models significantly outperform traditional multi-point joint training models in detecting individual landmarks. Although our adapter-based fusion model shows slightly lower performance compared to the combined results of all single-landmark models, it still surpasses the current state-of-the-art methods while achieving a notable improvement in resource efficiency.
中文摘要:本研究提出了一种医学标志点检测的新方法,通过训练专用单点模型实现高精度检测,并采用基于适配器的融合模型共享参数,有效解决了多标志点训练中的“跷跷板现象”,在显著降低资源消耗的同时超越了现有最优方法。
English Summary: This study introduces a novel dual-approach method for medical landmark detection that trains dedicated single-landmark models for high accuracy while using an adapter-based fusion model to efficiently share parameters, effectively resolving the seesaw phenomenon and reducing resource demands while outperforming existing methods.

Authors:Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
Title: The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement
Abstract:
Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.
Chinese: 本文提出“批判引导改进”(CGI)框架,通过评论模型生成细粒度自然语言反馈来指导行动模型,在交互环境中实现更全面的策略探索,显著超越现有基线性能。
English: This paper introduces Critique-Guided Improvement (CGI), a two-player framework where a critic model provides detailed natural language feedback to guide an actor model, enabling more robust exploration and superior performance in interactive environments compared to existing methods.

Authors:Marc Benedí San Millán, Angela Dai, Matthias Nießner
Title: Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models
Abstract:
Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.
中文: 该方法利用生成视频模型的运动先验和文本提示,从静态3D人形网格合成4D动画,为创建多样且逼真的动画提供了一种经济高效的解决方案。
English: This approach synthesizes 4D animations from static 3D humanoid meshes using generative video models' motion priors and text prompts, offering a cost-effective solution for creating diverse and realistic animations.

Authors:Ruhma Khan, Sumit Gulwani, Vu Le, Arjun Radhakrishna, Ashish Tiwari, Gust Verbruggen
Title: LLM-Guided Compositional Program Synthesis
Abstract:
Program synthesis from input-output examples, also called programming by example (PBE), has had tremendous impact on automating end-user tasks. Large language models (LLMs) have the ability to solve PBE tasks by generating code in different target languages, but they can fail unpredictably. To recover for failure, most approaches, such as self-reflection, use the LLM to solve the same task, but with a richer context. We introduce a novel technique that recovers from failure by constructing simpler subtasks for the LLM to solve. Our approach performs compositional program synthesis using LLMs, where LLM not only guides the decomposition of the PBE task into subtasks, but also solves the subtasks. We present different strategies for decomposing the original task. We experimentally show that our approach can solve challenging task instances that are not solved by self-reflection alone.
Chinese: 本文提出了一种新颖的组合式程序合成方法,利用大型语言模型将编程示例任务分解为更简单的子任务,从而能够解决仅靠自我反思方法无法处理的复杂实例。
English: This paper introduces a novel compositional program synthesis approach using large language models (LLMs) that decomposes programming by example (PBE) tasks into simpler subtasks, enabling the solution of challenging instances beyond the capabilities of self-reflection methods alone.

Authors:Harold Haodong Chen, Haojian Huang, Xianfeng Wu, Yexin Liu, Yajing Bai, Wen-Jie Shu, Harry Yang, Ser-Nam Lim
Title: Temporal Regularization Makes Your Video Generator Stronger
Abstract:
Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.
中文摘要:FluxFlow首次在视频生成中引入时间增强技术,通过在数据层面施加受控扰动,无需修改模型结构即可显著提升多种生成模型的时间连贯性与多样性。
English Summary: FluxFlow introduces temporal augmentation to enhance video generation by applying controlled perturbations at the data level, significantly improving temporal coherence and diversity across multiple models without architectural changes.

Authors:Junyi Ao, Dekun Chen, Xiaohai Tian, Wenjie Feng, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
Title: Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
Abstract:
Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.
中文:Solla是一种创新框架,能同时理解语音提问与音频语境,通过集成音频标记模块和语音识别辅助预测方法,在混合语音与音频处理任务中表现优于基线模型。
English: Solla is a new framework that simultaneously processes speech-based questions and audio context, outperforming baseline models in understanding mixed speech and audio inputs through its audio tagging and ASR-assisted prediction components.

Authors:Jingyi Chen, Songqiang Chen, Jialun Cao, Jiasi Shen, Shing-Chi Cheung
Title: When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers?
Abstract:
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. Existing works have shown that RAG can help with software development tasks such as code generation, code update, and test generation. Yet, the effectiveness of adapting LLMs to fast-evolving or less common API libraries using RAG remains unknown. To bridge this gap, we take an initial step to study this unexplored yet practical setting - when developers code with a less common library, they often refer to its API documentation; likewise, when LLMs are allowed to look up API documentation via RAG, to what extent can LLMs be advanced? To mimic such a setting, we select four less common open-source Python libraries with a total of 1017 eligible APIs. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation. Our intensive study yields interesting findings: (1) RAG helps improve LLMs' performance by 83%-220%. (2) Example code contributes the most to advance LLMs, instead of the descriptive texts and parameter lists in the API documentation. (3) LLMs could sometimes tolerate mild noises (typos in description or incorrect parameters) by referencing their pre-trained knowledge or document context. Finally, we suggest that developers pay more attention to the quality and diversity of the code examples in the API documentation. The study sheds light on future low-code software development workflows.
中文: 检索增强生成通过利用API文档可将大型语言模型性能提升83%-220%,其中代码示例对适应冷门库最为关键,为低代码开发流程提供了重要启示。
English: Retrieval-augmented generation significantly enhances large language models' performance by 83%-220% when using API documentation, with code examples proving most effective for adapting to less common libraries.

Authors:Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim
Title: VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention
Abstract:
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.
中文: VGoT通过动态剧情建模、跨镜头身份保持传播和相邻潜在过渡机制,解决了多镜头视频生成中的叙事碎片化、视觉不一致和过渡伪影问题,在各项一致性指标上实现显著提升。
English: VGoT is an automated framework that overcomes narrative fragmentation, visual inconsistency, and transition artifacts in multi-shot video generation by implementing dynamic storyline modeling, identity-preserving propagation, and seamless transition mechanisms, achieving significant improvements in consistency metrics.

Authors:Amir Hamza, Andrea Caraffa, Davide Boscaini, Fabio Poiesi
Title: Distilling 3D distinctive local descriptors for 6D pose estimation
Abstract:
Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi's effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/
中文摘要:本文提出一种知识蒸馏框架,通过训练高效学生模型来复现GeDi的局部描述符,在保持6D姿态估计竞争力的同时大幅降低推理时间,使零样本姿态估计更接近实时应用。
English Summary: This paper introduces a knowledge distillation framework that trains an efficient student model to replicate GeDi's local descriptors, achieving competitive 6D pose estimation performance while drastically reducing inference time for real-time applications.

Authors:Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, Bin Li
Title: DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling
Abstract:
Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE's 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE's good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at https://drope-traj.github.io/.
Chinese: 提出的方向性旋转位置嵌入(DRoPE)通过改进角度编码优化了自动驾驶中的轨迹生成,在精度、计算时间和内存效率上均优于现有方法,理论和实证评估均验证了其优越性能。
English: The proposed Directional Rotary Position Embedding (DRoPE) enhances trajectory generation in autonomous driving by optimizing accuracy, computational time, and memory efficiency through improved angular encoding, outperforming existing methods in both theoretical analysis and empirical evaluations.

Authors:Caleb Robinson, Anthony Ortiz, Allen Kim, Rahul Dodhia, Andrew Zolli, Shivaprakash K Nagaraju, James Oakleaf, Joe Kiesecker, Juan M. Lavista Ferres
Title: Global Renewables Watch: A Temporal Dataset of Solar and Wind Energy Derived from Satellite Imagery
Abstract:
We present a comprehensive global temporal dataset of commercial solar photovoltaic (PV) farms and onshore wind turbines, derived from high-resolution satellite imagery analyzed quarterly from the fourth quarter of 2017 to the second quarter of 2024. We create this dataset by training deep learning-based segmentation models to identify these renewable energy installations from satellite imagery, then deploy them on over 13 trillion pixels covering the world. For each detected feature, we estimate the construction date and the preceding land use type. This dataset offers crucial insights into progress toward sustainable development goals and serves as a valuable resource for policymakers, researchers, and stakeholders aiming to assess and promote effective strategies for renewable energy deployment. Our final spatial dataset includes 375,197 individual wind turbines and 86,410 solar PV installations. We aggregate our predictions to the country level -- estimating total power capacity based on construction date, solar PV area, and number of windmills -- and find an $r^2$ value of $0.96$ and $0.93$ for solar PV and onshore wind respectively compared to IRENA's most recent 2023 country-level capacity estimates.
中文摘要:本研究通过深度学习分析2017至2024年间的卫星图像,构建了全球商业太阳能电站和陆上风力发电机的时空数据集,为可再生能源政策制定和可持续发展评估提供了重要依据。
English Summary: This study introduces a global dataset of commercial solar farms and onshore wind turbines developed through deep learning analysis of satellite imagery from 2017-2024, providing critical insights for renewable energy policy and sustainable development assessments.

Authors:Chuxin Wang, Wenfei Yang, Xiang Liu, Tianzhu Zhang
Title: State Space Model Meets Transformer: A New Paradigm for 3D Object Detection
Abstract:
DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
中文: DEST方法提出了一种交互式状态空间模型,以线性复杂度同时更新场景点特征和查询特征,在三维室内物体检测基准上实现了最先进的性能。
English: The DEST method introduces an interactive state space model that dynamically updates both scene point features and query features with linear complexity, achieving state-of-the-art performance on 3D indoor object detection benchmarks.

Authors:Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, Nicola Cancedda
Title: Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Abstract:
LLMs often adopt an assertive language style also when making false claims. Such ``overconfident hallucinations'' mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that ``verbal uncertainty'' is governed by a single linear feature in the representation space of LLMs, and show that this has only moderate correlation with the actual ``semantic uncertainty'' of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce confident hallucinations on short-form answers, achieving an average relative reduction of ~30%.
中文: 大型语言模型常以过度自信表达错误主张,研究发现语义不确定性与语言不确定性之间的不匹配能更有效预测幻觉,通过干预可将其在简短回答中的自信幻觉平均减少约30%。
English: Large language models often express false claims with unwarranted confidence, but by identifying a mismatch between their semantic and verbal uncertainty, interventions can reduce such overconfident hallucinations by about 30% in short-form answers.

Authors:Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, Kaicheng Yu
Title: DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Abstract:
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
中文总结:DualToken通过为高层语义和低层感知特征分别引入码本,解决了自回归模型中视觉理解与生成的冲突,在两项任务中均实现了最优性能。
English Summary: DualToken introduces separate codebooks for high-level semantics and low-level perceptual features to resolve the conflict between visual understanding and generation in autoregressive models, achieving state-of-the-art performance in both tasks.

Authors:Huan Ren, Wenfei Yang, Xiang Liu, Shifeng Zhang, Tianzhu Zhang
Title: Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation
Abstract:
Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
中文: SpherePose创新性地采用球形表示作为共享代理形状,通过SO(3)不变特征、球形注意力机制和双曲对应损失函数,有效解决了类别级物体姿态估计中的语义不一致问题,在多个基准测试中展现出优越性能。
English: SpherePose introduces a novel approach to category-level object pose estimation by using spherical representations as a shared proxy shape to overcome semantic incoherence in existing methods, achieving superior performance through SO(3)-invariant features, spherical attention, and hyperbolic correspondence loss.

Authors:Jiaqi Yang, Wenting Chen, Xiaohan Xing, Sean He, Xiaoling Luo, Xinheng Lyu, Linlin Shen, Guoping Qiu
Title: HySurvPred: Multimodal Hyperbolic Embedding with Angle-Aware Hierarchical Contrastive Learning and Uncertainty Constraints for Survival Prediction
Abstract:
Multimodal learning that integrates histopathology images and genomic data holds great promise for cancer survival prediction. However, existing methods face key limitations: 1) They rely on multimodal mapping and metrics in Euclidean space, which cannot fully capture the hierarchical structures in histopathology (among patches from different resolutions) and genomics data (from genes to pathways). 2) They discretize survival time into independent risk intervals, which ignores its continuous and ordinal nature and fails to achieve effective optimization. 3) They treat censorship as a binary indicator, excluding censored samples from model optimization and not making full use of them. To address these challenges, we propose HySurvPred, a novel framework for survival prediction that integrates three key modules: Multimodal Hyperbolic Mapping (MHM), Angle-aware Ranking-based Contrastive Loss (ARCL) and Censor-Conditioned Uncertainty Constraint (CUC). Instead of relying on Euclidean space, we design the MHM module to explore the inherent hierarchical structures within each modality in hyperbolic space. To better integrate multimodal features in hyperbolic space, we introduce the ARCL module, which uses ranking-based contrastive learning to preserve the ordinal nature of survival time, along with the CUC module to fully explore the censored data. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on five benchmark datasets. The source code is to be released.
中文: 提出的HySurvPred框架通过双曲空间捕捉多模态数据的层次结构,采用基于排序的对比学习保留生存时间的顺序特性,并充分利用删失数据,在多个基准数据集上实现了最优性能。
English: The proposed HySurvPred framework addresses limitations in cancer survival prediction by using hyperbolic space to capture hierarchical data structures, employing contrastive learning for survival time ranking, and fully utilizing censored data, achieving superior performance on benchmark datasets.

Authors:Johannes Meier, Louis Inchingolo, Oussema Dhaouadi, Yan Xia, Jacques Kaiser, Daniel Cremers
Title: MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models
Abstract:
We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.
中文: 本文提出MonoCT方法,通过广义深度增强模块和伪标签评分策略实现无监督领域自适应,在多种摄像头配置下显著提升了单目3D物体检测性能,大幅超越现有最优方法。
English: This paper introduces MonoCT, an unsupervised domain adaptation method that enhances monocular 3D object detection through a Generalized Depth Enhancement module and Pseudo Label Scoring strategy, achieving state-of-the-art performance across diverse camera setups.

Authors:Zeng Wang, Minghao Shao, Jitendra Bhandari, Likhitha Mankali, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Title: VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination
Abstract:
Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).
中文摘要:本研究首次揭示Verilog代码生成基准测试中存在严重数据污染问题,证实其影响评估有效性,并探讨了代码质量与公平基准之间的权衡关系。
English Summary: This study reveals significant data contamination in state-of-the-art Verilog code generation benchmarks, confirming it undermines evaluation validity while exploring trade-offs between code quality and fair assessment.

Authors:James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Title: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
中文: MicroVQA是一个专为生物学研究设计的视觉问答基准,用于评估多模态推理能力,填补了现有基准的不足,并通过测试先进模型揭示了感知错误是科学推理中的主要挑战。
English: MicroVQA is a research-level visual question answering benchmark developed to evaluate critical multimodal reasoning skills in biology, addressing gaps in existing benchmarks and revealing key challenges through testing state-of-the-art models.

Authors:Zeng Wang, Minghao Shao, Mohammed Nabeel, Prithwish Basu Roy, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Title: VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding
Abstract:
Large language models (LLMs) offer significant potential for coding, yet fine-tuning (FT) with curated data is essential for niche languages like Verilog. Using proprietary intellectual property (IP) for FT presents a serious risk, as FT data can be leaked through LLM inference. This leads to a critical dilemma for design houses: seeking to build externally accessible LLMs offering competitive Verilog coding, how can they leverage in-house IP to enhance FT utility while ensuring IP protection? For the first time in the literature, we study this dilemma. Using LLaMA 3.1-8B, we conduct in-house FT on a baseline Verilog dataset (RTLCoder) supplemented with our own in-house IP, which is validated through multiple tape-outs. To rigorously assess IP leakage, we quantify structural similarity (AST/Dolos) and functional equivalence (Synopsys Formality) between generated codes and our in-house IP. We show that our IP can indeed be leaked, confirming the threat. As defense, we evaluate logic locking of Verilog codes (ASSURE). This offers some level of protection, yet reduces the IP's utility for FT and degrades the LLM's performance. Our study shows the need for novel strategies that are both effective and minimally disruptive to FT, an essential effort for enabling design houses to fully utilize their proprietary IP toward LLM-driven Verilog coding.
中文摘要:针对Verilog等小众语言,使用专有知识产权对大语言模型进行微调存在通过模型推理泄露IP的风险,亟需开发既能有效保护知识产权又尽可能减少对微调干扰的新型防护策略。
English Summary: Fine-tuning large language models with proprietary intellectual property for niche languages like Verilog risks IP leakage through model inference, necessitating novel protection strategies that balance security with model utility.

Authors:Alexander Pugachev, Alena Fenogenova, Vladislav Mikhailov, Ekaterina Artemova
Title: REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Abstract:
Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
中文摘要:最新研究通过引入俄语错误类型标注数据集,探索了使用大语言模型作为俄语输出评估工具的效果,发现其与英语表现存在差距,但在与人类偏好对齐方面显示出改进潜力。
English Summary: Recent research explores using large language models as judges for evaluating other models' outputs in Russian, revealing a performance gap compared to English but showing potential for alignment with human preferences through the new REPA dataset.

Authors:Yunshuang Yuan, Yan Xia, Daniel Cremers, Monika Sester
Title: SparseAlign: A Fully Sparse Framework for Cooperative Object Detection
Abstract:
Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird's Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.
中文: 协同感知通过扩展视野和减少遮挡提升自动驾驶安全性,而SparseAlign框架凭借其高效的稀疏设计和更低带宽需求,在性能上超越了现有最佳方法。
English: Cooperative perception enhances autonomous driving safety by expanding the field of view and reducing occlusions, with the novel SparseAlign framework outperforming existing methods through its efficient sparse design and lower bandwidth usage.

Authors:Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, Dongbin Zhao
Title: Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Abstract:
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
中文: 直接偏好优化(DPO)作为一种计算高效的强化学习替代方案,通过迭代式自我增强显著提升大语言模型的数学推理能力,在降低成本的同时实现了可比性能。
English: Direct Preference Optimization (DPO) offers a computationally efficient alternative to reinforcement learning, significantly improving large language models' mathematical reasoning through iterative self-enhancement and achieving comparable performance with lower costs.

Authors:Yuebing Liang, Shenhao Wang, Jiangbo Yu, Zhan Zhao, Jinhua Zhao, Sandy Pentland
Title: Analyzing sequential activity and travel decisions with interpretable deep inverse reinforcement learning
Abstract:
Travel demand modeling has shifted from aggregated trip-based models to behavior-oriented activity-based models because daily trips are essentially driven by human activities. To analyze the sequential activity-travel decisions, deep inverse reinforcement learning (DIRL) has proven effective in learning the decision mechanisms by approximating a reward function to represent preferences and a policy function to replicate observed behavior using deep neural networks (DNNs). However, most existing research has focused on using DIRL to enhance only prediction accuracy, with limited exploration into interpreting the underlying decision mechanisms guiding sequential decision-making. To address this gap, we introduce an interpretable DIRL framework for analyzing activity-travel decision processes, bridging the gap between data-driven machine learning and theory-driven behavioral models. Our proposed framework adapts an adversarial IRL approach to infer the reward and policy functions of activity-travel behavior. The policy function is interpreted through a surrogate interpretable model based on choice probabilities from the policy function, while the reward function is interpreted by deriving both short-term rewards and long-term returns for various activity-travel patterns. Our analysis of real-world travel survey data reveals promising results in two key areas: (i) behavioral pattern insights from the policy function, highlighting critical factors in decision-making and variations among socio-demographic groups, and (ii) behavioral preference insights from the reward function, indicating the utility individuals gain from specific activity sequences.
中文摘要:本文提出了一种可解释的深度逆向强化学习框架,通过推断奖励函数和策略函数来分析活动-出行决策过程,从真实出行数据中揭示了行为模式和偏好特征。
English Summary: This paper introduces an interpretable deep inverse reinforcement learning framework that analyzes activity-travel decision processes by inferring reward and policy functions, providing insights into behavioral patterns and preferences from real-world travel data.

Authors:Arthur Corrêa, Cristóvão Silva, Liming Xu, Alexandra Brintrup, Samuel Moniz
Title: TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems
Abstract:
This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing different variants of vehicle routing problems (VRP). Recently, multi-task learning has gained much attention for solving VRP variants. However, this adaptability often compromises the performance of the models. To address this challenge, we first pre-train a reinforcement learning model on the multi-depot VRP, followed by a short fine-tuning phase to adapt it to different variants. By leveraging the complexity of the multi-depot VRP, the pre-trained model learns richer node representations and gains more transferable knowledge compared to models trained on simpler routing problems, such as the traveling salesman problem. TuneNSearch employs, in the first stage, a Transformer-based architecture, augmented with a residual edge-graph attention network to capture the impact of edge distances and residual connections between layers. This architecture allows for a more precise capture of graph-structured data, improving the encoding of VRP's features. After inference, our model is also coupled with a second stage composed of a local search algorithm, which yields substantial performance gains with minimal computational overhead added. Results show that TuneNSearch outperforms many existing state-of-the-art models trained for each VRP variant, requiring only one-fifth of the training epochs. Our approach demonstrates strong generalization, achieving high performance across different tasks, distributions and problem sizes, thus addressing a long-standing gap in the literature.
本文介绍了TuneNSearch,这是一种结合迁移学习和局部搜索的混合方法,能高效解决多种车辆路径问题变体,以最少训练实现优异性能并具备强大泛化能力。
This paper presents TuneNSearch, a hybrid method combining transfer learning and local search to efficiently solve various vehicle routing problem variants, achieving superior performance with minimal training and strong generalization.

Authors:Xianzu Wu, Zhenxin Ai, Harry Yang, Ser-Nam Lim, Jun Liu, Huan Wang
Title: Niagara: Normal-Integrated Geometric Affine Field for Scene Reconstruction from a Single View
Abstract:
Recent advances in single-view 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling. This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time. Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation. Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction. Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes.
中文: Niagara框架通过结合单目深度与法线估计及几何约束,提出了一种创新的单视图户外场景三维重建方法,在细节还原和结构一致性上显著超越现有技术,实现了更高的几何精度和视觉保真度。
English: The Niagara framework introduces a novel single-view 3D reconstruction method for outdoor scenes by integrating monocular depth and normal estimation with geometric constraints, achieving superior detail capture and surpassing prior state-of-the-art approaches in accuracy and fidelity.

Authors:Yunbo Long, Liming Xu, Ge Zheng, Alexandra Brintrup
Title: PA-CFL: Privacy-Adaptive Clustered Federated Learning for Transformer-Based Sales Forecasting on Heterogeneous Retail Data
Abstract:
Federated learning (FL) enables retailers to share model parameters for demand forecasting while maintaining privacy. However, heterogeneous data across diverse regions, driven by factors such as varying consumer behavior, poses challenges to the effectiveness of federated learning. To tackle this challenge, we propose Privacy-Adaptive Clustered Federated Learning (PA-CFL) tailored for demand forecasting on heterogeneous retail data. By leveraging differential privacy and feature importance distribution, PA-CFL groups retailers into distinct ``bubbles'', each forming its own federated learning system to effectively isolate data heterogeneity. Within each bubble, Transformer models are designed to predict local sales for each client. Our experiments demonstrate that PA-CFL significantly surpasses FedAvg and outperforms local learning in demand forecasting performance across all participating clients. Compared to local learning, PA-CFL achieves a 5.4% improvement in R^2, a 69% reduction in RMSE, and a 45% decrease in MAE. Our approach enables effective FL through adaptive adjustments to diverse noise levels and the range of clients participating in each bubble. By grouping participants and proactively filtering out high-risk clients, PA-CFL mitigates potential threats to the FL system. The findings demonstrate PA-CFL's ability to enhance federated learning in time series prediction tasks with heterogeneous data, achieving a balance between forecasting accuracy and privacy preservation in retail applications. Additionally, PA-CFL's capability to detect and neutralize poisoned data from clients enhances the system's robustness and reliability.
中文: 提出的隐私自适应聚类联邦学习(PA-CFL)通过差分隐私和特征重要性将零售商分组为专门集群,在各集群内使用Transformer模型进行本地销售预测,在有效应对数据异质性的同时显著提升了需求预测精度,并实现了隐私保护与系统鲁棒性的平衡。
English: The proposed Privacy-Adaptive Clustered Federated Learning (PA-CFL) groups retailers into specialized clusters using differential privacy and feature importance, employing Transformer models within each cluster to significantly improve demand forecasting accuracy while maintaining privacy and system robustness against data heterogeneity and security threats.

Authors:Yunbo Long, Liming Xu, Alexandra Brintrup
Title: Efficient and Privacy-Preserved Link Prediction via Condensed Graphs
Abstract:
Link prediction is crucial for uncovering hidden connections within complex networks, enabling applications such as identifying potential customers and products. However, this research faces significant challenges, including concerns about data privacy, as well as high computational and storage costs, especially when dealing with large-scale networks. Condensed graphs, which are much smaller than the original graphs while retaining essential information, has become an effective solution to both maintain data utility and preserve privacy. Existing methods, however, initialize synthetic graphs through random node selection without considering node connectivity, and are mainly designed for node classification tasks. As a result, their potential for privacy-preserving link prediction remains largely unexplored. We introduce HyDRO\textsuperscript{+}, a graph condensation method guided by algebraic Jaccard similarity, which leverages local connectivity information to optimize condensed graph structures. Extensive experiments on four real-world networks show that our method outperforms state-of-the-art methods and even the original networks in balancing link prediction accuracy and privacy preservation. Moreover, our method achieves nearly 20* faster training and reduces storage requirements by 452*, as demonstrated on the Computers dataset, compared to link prediction on the original networks. This work represents the first attempt to leverage condensed graphs for privacy-preserving link prediction information sharing in real-world complex networks. It offers a promising pathway for preserving link prediction information while safeguarding privacy, advancing the use of graph condensation in large-scale networks with privacy concerns.
中文摘要:HyDRO⁺提出了一种基于代数Jaccard相似度的图压缩方法,通过优化压缩图结构实现隐私保护的链接预测,在精度、训练速度和存储效率上均优于现有方法及原始网络。
English Summary: HyDRO⁺ introduces a graph condensation method using algebraic Jaccard similarity to optimize structures for privacy-preserving link prediction, achieving superior accuracy, faster training, and reduced storage compared to existing methods.

Authors:Chen Shu, Mengke Li, Yiqun Zhang, Yang Lu, Bo Han, Yiu-ming Cheung, Hanzi Wang
Title: Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning
Abstract:
In real-world datasets, the challenges of long-tailed distributions and noisy labels often coexist, posing obstacles to the model training and performance. Existing studies on long-tailed noisy label learning (LTNLL) typically assume that the generation of noisy labels is independent of the long-tailed distribution, which may not be true from a practical perspective. In real-world situaiton, we observe that the tail class samples are more likely to be mislabeled as head, exacerbating the original degree of imbalance. We call this phenomenon as ``tail-to-head (T2H)'' noise. T2H noise severely degrades model performance by polluting the head classes and forcing the model to learn the tail samples as head. To address this challenge, we investigate the dynamic misleading process of the nosiy labels and propose a novel method called Disentangling and Unlearning for Long-tailed and Label-noisy data (DULL). It first employs the Inner-Feature Disentangling (IFD) to disentangle feature internally. Based on this, the Inner-Feature Partial Unlearning (IFPU) is then applied to weaken and unlearn incorrect feature regions correlated to wrong classes. This method prevents the model from being misled by noisy labels, enhancing the model's robustness against noise. To provide a controlled experimental environment, we further propose a new noise addition algorithm to simulate T2H noise. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our proposed method.
中文摘要:现实数据集常面临长尾分布和噪声标签的双重挑战,其中尾类样本更易被误标为头类,加剧了数据不平衡;为此提出的DULL方法通过特征解耦和部分遗忘技术,有效提升模型对此类噪声的鲁棒性。
English Summary: Real-world datasets often face the dual challenges of long-tailed distributions and noisy labels, where tail classes are more likely to be mislabeled as head classes, worsening imbalance; to address this, the proposed DULL method uses feature disentangling and partial unlearning to enhance model robustness against such noise.

Authors:Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong
Title: BriLLM: Brain-inspired Large Language Model
Abstract:
We introduce BriLLM, a brain-inspired large language model that fundamentally redefines the foundations of machine learning through its implementation of Signal Fully-connected flowing (SiFu) learning. This work addresses the critical bottleneck hindering AI's progression toward Artificial General Intelligence (AGI)--the disconnect between language models and "world models"--as well as the fundamental limitations of Transformer-based architectures rooted in the conventional representation learning paradigm. BriLLM incorporates two pivotal neurocognitive principles: (1) static semantic mapping, where tokens are mapped to specialized nodes analogous to cortical areas, and (2) dynamic signal propagation, which simulates electrophysiological information dynamics observed in brain activity. This architecture enables multiple transformative breakthroughs: natural multi-modal compatibility, full model interpretability at the node level, context-length independent scaling, and the first global-scale simulation of brain-like information processing for language tasks. Our initial 1-2B parameter models successfully replicate GPT-1-level generative capabilities while demonstrating stable perplexity reduction. Scalability analyses confirm the feasibility of 100-200B parameter variants capable of processing 40,000-token vocabularies. The paradigm is reinforced by both Occam's Razor--evidenced in the simplicity of direct semantic mapping--and natural evolution--given the brain's empirically validated AGI architecture. BriLLM establishes a novel, biologically grounded framework for AGI advancement that addresses fundamental limitations of current approaches.
中文: BriLLM是一种受大脑启发的大型语言模型,通过引入信号全连接流学习,融合神经认知原理以克服人工智能的局限,实现多模态兼容性和全节点可解释性等突破,为推进通用人工智能建立了基于生物学的全新框架。
English: BriLLM is a brain-inspired large language model that introduces Signal Fully-connected flowing learning, incorporating neurocognitive principles to overcome AI's limitations and enable breakthroughs such as multi-modal compatibility and full interpretability, establishing a biologically grounded framework for advancing toward Artificial General Intelligence.

Authors:Weiye Gan, Yicheng Li, Qian Lin, Zuoqiang Shi
Title: Neural Tangent Kernel of Neural Networks with Loss Informed by Differential Operators
Abstract:
Spectral bias is a significant phenomenon in neural network training and can be explained by neural tangent kernel (NTK) theory. In this work, we develop the NTK theory for deep neural networks with physics-informed loss, providing insights into the convergence of NTK during initialization and training, and revealing its explicit structure. We find that, in most cases, the differential operators in the loss function do not induce a faster eigenvalue decay rate and stronger spectral bias. Some experimental results are also presented to verify the theory.
Chinese: 本研究将神经正切核理论扩展到具有物理信息损失的深度神经网络,揭示了微分算子通常不会增强谱偏置或加速特征值衰减,并通过实验验证了这一理论。
English: This study extends neural tangent kernel theory to deep neural networks with physics-informed loss, revealing that differential operators generally do not enhance spectral bias or accelerate eigenvalue decay, which is supported by experimental validation.

Authors:Yi Wu, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
Title: Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
Abstract:
Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
中文: 本研究提出Proxy-Tuning方法增强多模态自回归模型的主题驱动图像生成能力,揭示了弱到强的泛化现象——微调后的模型在主题保真度和提示遵循上超越扩散模型,尤其在多主题构图场景中表现突出。
English: This study introduces Proxy-Tuning to enhance multimodal autoregressive models' subject-driven image generation, revealing a weak-to-strong phenomenon where fine-tuned models surpass diffusion models in fidelity and prompt adherence, particularly in multi-subject compositions.

Authors:Siyang Zhang, Harry Yang, Ser-Nam Lim
Title: VideoMerge: Towards Training-free Long Video Generation
Abstract:
Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.
中文: 长视频生成因计算成本高和模型输出固定而面临挑战,但VideoMerge提供无需训练的方法,通过合并短视频实现更长、更一致的视频生成。
English: Long video generation is challenging due to high computational costs and fixed output dimensions in diffusion models, but VideoMerge offers a training-free solution to merge short videos for extended, consistent results.

Authors:Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Jing Liang, Vignesh Rajagopal, Dinesh Manocha
Title: Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments
Abstract:
We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning. Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from the pre-trained vision-action model, combined with attention-like semantic maps constructed from a large VLM. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then utilized as a traversability costmap within a socially aware model predictive controller (MPC) for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art (SOTA) navigation methods. Our results show up to 14.2% - 50% improvement in success rate, which highlights the effectiveness of Vi-LAD in enabling socially compliant and efficient robot navigation.
Chinese: 我们提出视觉语言注意力蒸馏(Vi-LAD)方法,通过注意力图蒸馏将大型视觉语言模型中的社会导航知识迁移至轻量级Transformer,在真实机器人导航中实现了最高50%的成功率提升。
English: We propose Vision-Language Attention Distillation (Vi-LAD), a method that transfers social navigation knowledge from large vision-language models to lightweight transformers using attention map distillation, achieving up to 50% higher success rates in real-world robot navigation.

Authors:Harris K. Armeniakos, Petros S. Bithas, Sotiris A. Tegos, Athanasios G. Kanatas, George K. Karagiannidis
Title: Stochastic Geometry for Modeling and Analysis of Sensing and Communications: A Survey
Abstract:
One of the most promising technologies for next-generation wireless networks is integrated communication and sensing (ISAC). It is considered a key enabler for applications that require both enhanced communication and accurate sensing capabilities. Examples of such applications include smart environments, augmented and virtual reality, or the internet of things, where the capabilities of intelligent sensing and broadband communications are vital. Therefore, ISAC has attracted the research interest of both academia and industry, and many investigations have been carried out over the past decade. The articles in the literature include system models, performance evaluation, and optimization studies of several ISAC alternative designs. Stochastic geometry is the study and analysis of random spatial patterns, and as such, stochastic geometry tools have been considered for the performance evaluation of wireless networks with different types of nodes. In this paper, we aim to provide a comprehensive survey of current research progress in performance evaluation of ISAC systems using stochastic geometry tools. The survey covers terrestrial, aerial, and vehicular networks, where the random spatial location of the corresponding network elements and propagation scatterers and/or blockages is treated with various point processes. The paper starts with a short tutorial on ISAC technology, stochastic geometry tools, and metrics used in performance evaluation of communication and sensing. Then, the technical components of the system models utilized in the surveyed papers are discussed. Subsequently, we present the key results of the literature in all types of networks using three levels of integration: sensing-assisted communication, communication-assisted sensing, and joint sensing and communication. Finally, future research challenges and promising directions are discussed.
中文摘要:本文综述了利用随机几何工具评估集成通信与感知(ISAC)系统在陆地、空中和车载网络中的性能表现,涵盖不同集成层级并探讨了未来研究方向。
English Summary: This paper surveys the use of stochastic geometry tools for evaluating the performance of integrated communication and sensing (ISAC) systems across terrestrial, aerial, and vehicular networks, covering various integration levels and discussing future research directions.

Authors:Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han
Title: GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs
Abstract:
Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.
Chinese: 大语言模型遗忘技术对清除敏感数据至关重要,但常以牺牲通用性能为代价,因此提出了梯度修正遗忘(GRU)框架,通过调控梯度更新方向来最小化副作用。
English: Large language model unlearning is crucial for removing sensitive data but often compromises general performance, prompting the development of Gradient Rectified Unlearning (GRU), a framework that minimizes side effects by regulating gradient updates during the process.

Authors:Asmaa Abdallah, Abdulkadir Celik, Ahmed Alkhateeb, Ahmed M. Eltawil
Title: Explainable Autoencoder Design for RSSI-Based Multi-User Beam Probing and Hybrid Precoding
Abstract:
This paper introduces a novel neural network (NN) structure referred to as an ``Auto-hybrid precoder'' (Auto-HP) and an unsupervised deep learning (DL) approach that jointly designs \ac{mmWave} probing beams and hybrid precoding matrix design for mmWave multi-user communication system with minimal training pilots. Our learning-based model capitalizes on prior channel observations to achieve two primary goals: designing a limited set of probing beams and predicting off-grid \ac{RF} beamforming vectors. The Auto-HP framework optimizes the probing beams in an unsupervised manner, concentrating the sensing power on the most promising spatial directions based on the surrounding environment. This is achieved through an innovative neural network architecture that respects \ac{RF} chain constraints and models received signal strength power measurements using complex-valued convolutional layers. Then, the autoencoder is trained to directly produce RF beamforming vectors for hybrid architectures, unconstrained by a predefined codebook, based on few projected received signal strength indicators (RSSIs). Finally, once the RF beamforming vectors for the multi-users are predicted, the baseband (BB) digital precoders are designed accounting for the multi-user interference. The Auto-HP neural network is trained end-to-end (E2E) in an unsupervised learning manner with a customized loss function that aims to maximizes the received signal strength. The adequacy of the Auto-HP NN's bottleneck layer dimension is evaluated from an information theory perspective, ensuring maximum data compression and reliable RF beam predictions.
本文提出了一种自动混合预编码器(Auto-HP)神经网络,通过无监督深度学习联合设计毫米波探测波束和混合预编码矩阵,利用环境感知和复值卷积层优化波束成形向量,在最少训练导频下实现信号强度最大化。
This paper presents an Auto-hybrid precoder (Auto-HP) neural network that uses unsupervised deep learning to jointly design mmWave probing beams and hybrid precoding matrices, optimizing beamforming vectors through environmental sensing and complex-valued convolutional layers while maximizing signal strength with minimal training pilots.

Authors:Minyue Dai, Ke Fan, Bin Ji, Haoran Xu, Haoyu Zhao, Junting Dong, Jingbo Wang, Bo Dai
Title: Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation
Abstract:
Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.
中文: 本文提出了一种新颖框架,在身体部位层面建模运动风格,通过周期性自编码器捕捉局部风格特征,实现在保持整体协调的同时独立调整肢体动作,从而生成更具表现力的动画。
English: This paper introduces a novel framework that models motion styles at the body-part level using periodic autoencoders to capture local style features, enabling independent limb adjustments while maintaining overall coherence for more expressive animations.

Authors:Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani
Title: HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
Abstract:
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 4.9% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc.
中文: HOTFormerLoc是一种基于分层八叉树的变换器,通过多尺度注意力机制和新型跨源数据集CS-Wild-Places,在多种环境下的地面与空中三维场景识别中实现了最先进的性能。
English: HOTFormerLoc is a hierarchical octree-based transformer that achieves state-of-the-art 3D place recognition across ground-to-ground and ground-to-aerial scenarios by leveraging multi-scale attention mechanisms and a novel cross-source dataset, CS-Wild-Places.

Authors:Jie Ying, Haowei Lin, Chao Yue, Yajie Chen, Chao Xiao, Quanqi Shi, Yitao Liang, Shing-Tung Yau, Yuan Zhou, Jianzhu Ma
Title: A Neural Symbolic Model for Space Physics
Abstract:
In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.
中文摘要:PhyE2E人工智能模型通过符号回归有效发现并优化物理公式,其性能超越现有方法,不仅成功升级了NASA 1993年的太阳活动公式,还为空间物理现象提供了新的解释。
English Summary: The PhyE2E AI model effectively discovers and refines physical formulas through symbolic regression, outperforming existing methods and successfully upgrading NASA's 1993 solar activity formula while providing new insights into space physics phenomena.

Authors:Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J. Pappas, Hamed Hassani
Title: Safety Guardrails for LLM-Enabled Robots
Abstract:
Although the integration of large language models (LLMs) into robotics has unlocked transformative capabilities, it has also introduced significant safety concerns, ranging from average-case LLM errors (e.g., hallucinations) to adversarial jailbreaking attacks, which can produce harmful robot behavior in real-world settings. Traditional robot safety approaches do not address the novel vulnerabilities of LLMs, and current LLM safety guardrails overlook the physical risks posed by robots operating in dynamic real-world environments. In this paper, we propose RoboGuard, a two-stage guardrail architecture to ensure the safety of LLM-enabled robots. RoboGuard first contextualizes pre-defined safety rules by grounding them in the robot's environment using a root-of-trust LLM, which employs chain-of-thought (CoT) reasoning to generate rigorous safety specifications, such as temporal logic constraints. RoboGuard then resolves potential conflicts between these contextual safety specifications and a possibly unsafe plan using temporal logic control synthesis, which ensures safety compliance while minimally violating user preferences. Through extensive simulation and real-world experiments that consider worst-case jailbreaking attacks, we demonstrate that RoboGuard reduces the execution of unsafe plans from 92% to below 2.5% without compromising performance on safe plans. We also demonstrate that RoboGuard is resource-efficient, robust against adaptive attacks, and significantly enhanced by enabling its root-of-trust LLM to perform CoT reasoning. These results underscore the potential of RoboGuard to mitigate the safety risks and enhance the reliability of LLM-enabled robots.
中文: 大语言模型在机器人领域的应用带来了安全隐患,而RoboGuard系统通过情境化安全规则和时序逻辑,将不安全计划执行率从92%降至2.5%以下,同时保持性能。
English: The integration of LLMs in robotics raises safety risks, addressed by the proposed RoboGuard system, which uses contextualized safety rules and temporal logic to reduce unsafe plan execution from 92% to under 2.5% while maintaining performance.

Authors:James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
Title: Video Action Differencing
Abstract:
How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.
中文: 本文提出了VidDiff任务,用于识别相同动作视频间的细微差异,并创建了VidDiffBench基准数据集,该数据集对现有先进模型构成挑战;同时设计了三阶段智能工作流方法,以解决动作定位和细粒度帧比较两大核心难题。
English: This paper introduces VidDiff, a novel task for identifying subtle differences between videos of the same action, and proposes VidDiffBench, a benchmark dataset that challenges state-of-the-art models, along with a three-stage agentic workflow method to address localization and fine-grained comparison difficulties.

Authors:Konstantinos Vergopoulos, Mark Niklas Müller, Martin Vechev
Title: Automated Benchmark Generation for Repository-Level Coding Tasks
Abstract:
Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench. This benchmark challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution. However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.
中文:代码代理研究中可靠的性能评估至关重要,但当前基准(如SWE-Bench)因人工设置限制存在规模与代表性不足的问题,可能导致评估结果与现实场景出现分布偏差。
English: The development of reliable performance metrics is crucial for code agent research, but current benchmarks like SWE-Bench face limitations in scale and representativeness due to manual setup constraints, leading to potential distributional mismatches with real-world scenarios.

Authors:Tianhe Lin, Jian Xie, Siyu Yuan, Deqing Yang
Title: Implicit Reasoning in Transformers is Reasoning through Shortcuts
Abstract:
Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.
Chinese: 测试时计算通过隐式推理提高语言模型效率,但其泛化能力受限,因模型从固定模式数据中习得捷径学习,难以适应新任务模式。
English: Test-time compute enhances language models' reasoning, with implicit reasoning being more efficient but limited in generalization due to shortcut learning from fixed-pattern training data.

Authors:Yangzhe Kong, Daeun Song, Jing Liang, Dinesh Manocha, Ziyu Yao, Xuesu Xiao
Title: AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning
Abstract:
We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs' spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs' limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross-validation score from expert systems in: perception & prediction (up to 10.71%), reasoning (up to 16.26%), action (up to 20.50%), and explanation (up to 18.73%) compared to baseline models trained only on manually annotated data.
中文: AutoSpatial提出了一种结合少量人工监督与自动标注VQA数据的高效方法,显著提升了视觉语言模型的空间推理能力,在社交导航关键组件上相比基线模型最高提升20.50%。
English: AutoSpatial introduces an efficient method combining minimal manual supervision with auto-labeled VQA pairs to significantly enhance VLMs' spatial reasoning, achieving up to 20.50% improvement in key social navigation components over baseline models.

Authors:Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang
Title: LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Abstract:
Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
中文:LMM-R1框架通过先用基于规则的强化学习增强文本推理能力,再将其推广到多模态领域,有效提升了紧凑模型的多模态推理性能,并在无需大量多模态训练数据的情况下实现了显著改进。
English: The LMM-R1 framework enhances multimodal reasoning in compact models by first strengthening text-based reasoning with rule-based reinforcement learning and then generalizing these skills to visual domains, achieving notable performance improvements without requiring extensive multimodal training data.

Authors:Albert Gassol Puigjaner, Manish Prajapat, Andrea Carron, Andreas Krause, Melanie N. Zeilinger
Title: Performance-driven Constrained Optimal Auto-Tuner for MPC
Abstract:
A key challenge in tuning Model Predictive Control (MPC) cost function parameters is to ensure that the system performance stays consistently above a certain threshold. To address this challenge, we propose a novel method, COAT-MPC, Constrained Optimal Auto-Tuner for MPC. With every tuning iteration, COAT-MPC gathers performance data and learns by updating its posterior belief. It explores the tuning parameters' domain towards optimistic parameters in a goal-directed fashion, which is key to its sample efficiency. We theoretically analyze COAT-MPC, showing that it satisfies performance constraints with arbitrarily high probability at all times and provably converges to the optimum performance within finite time. Through comprehensive simulations and comparative analyses with a hardware platform, we demonstrate the effectiveness of COAT-MPC in comparison to classical Bayesian Optimization (BO) and other state-of-the-art methods. When applied to autonomous racing, our approach outperforms baselines in terms of constraint violations and cumulative regret over time.
中文: COAT-MPC是一种新颖的自动调参方法,通过定向探索和迭代学习,以高概率保证系统性能约束并收敛至最优性能,在自动驾驶赛车等应用中展现出比传统方法更优越的约束遵循和累积遗憾表现。
English: COAT-MPC is a novel auto-tuning method that efficiently learns optimal parameters for Model Predictive Control, ensuring high-probability performance constraints and convergence to optimum performance with superior results in simulations and hardware applications like autonomous racing.

Authors:Yuting Hu, Chenhui Xu, Ruiyang Qin, Dancheng Liu, Amir Nassereldine, Yiyu Shi, Jinjun Xiong
Title: Combating Partial Perception Deficit in Autonomous Driving with Multimodal LLM Commonsense
Abstract:
Partial perception deficits can compromise autonomous vehicle safety by disrupting environmental understanding. Current protocols typically respond with immediate stops or minimal-risk maneuvers, worsening traffic flow and lacking flexibility for rare driving scenarios. In this paper, we propose LLM-RCO, a framework leveraging large language models to integrate human-like driving commonsense into autonomous systems facing perception deficits. LLM-RCO features four key modules: hazard inference, short-term motion planner, action condition verifier, and safety constraint generator. These modules interact with the dynamic driving environment, enabling proactive and context-aware control actions to override the original control policy of autonomous agents. To improve safety in such challenging conditions, we construct DriveLM-Deficit, a dataset of 53,895 video clips featuring deficits of safety-critical objects, complete with annotations for LLM-based hazard inference and motion planning fine-tuning. Extensive experiments in adverse driving conditions with the CARLA simulator demonstrate that systems equipped with LLM-RCO significantly improve driving performance, highlighting its potential for enhancing autonomous driving resilience against adverse perception deficits. Our results also show that LLMs fine-tuned with DriveLM-Deficit can enable more proactive movements instead of conservative stops in the context of perception deficits.
Chinese: 本文提出LLM-RCO框架,利用大语言模型将类人驾驶常识融入面临感知缺陷的自动驾驶系统,通过主动的情境感知控制策略显著提升了恶劣条件下的驾驶性能。
English: This paper introduces LLM-RCO, a framework that uses large language models to integrate human-like commonsense into autonomous vehicles facing perception deficits, enabling proactive and context-aware control actions that significantly improve driving performance in adverse conditions.

Authors:Shihao Hou, Xinyi Shang, Shreyank N Gowda, Yang Lu, Chao Wu, Yan Yan, Hanzi Wang
Title: CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model
Abstract:
Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.
中文: 提出的CAPT框架通过双提示机制和异质性感知的客户端聚类策略,有效解决了联邦学习中非独立同分布数据和长尾分布的双重挑战,显著提升了尾部类别性能而不影响整体准确率。
English: The proposed CAPT framework effectively addresses the dual challenges of non-IID data and long-tailed distributions in federated learning by introducing a dual-prompt mechanism and heterogeneity-aware client clustering, significantly improving tail class performance without compromising overall accuracy.

Authors:Shanshan Yan, Zexi Li, Chao Wu, Meng Pang, Yang Lu, Yan Yan, Hanzi Wang
Title: You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data
Abstract:
Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-bootstrap perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-bootstrap process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4\% under global long-tailed settings.
中文摘要:FedYoYo通过引入增强自引导蒸馏和分布感知逻辑调整,在不依赖额外资源的情况下提升本地表征学习并修正偏差特征,几乎消除了联邦学习与集中式学习之间的性能差距。
English Summary: FedYoYo introduces Augmented Self-bootstrap Distillation and Distribution-aware Logit Adjustment to nearly eliminate federated learning's performance gap with centralized learning by enhancing local representation learning and correcting biased features without requiring extra resources.

Authors:Ike Obi, Vishnunandan L. N. Venkatesh, Weizheng Wang, Ruiqi Wang, Dayoon Suh, Temitope I. Amosa, Wonse Jo, Byung-Cheol Min
Title: SafePlan: Leveraging Formal Logic and Chain-of-Thought Reasoning for Enhanced Safety in LLM-based Robotic Task Planning
Abstract:
Robotics researchers increasingly leverage large language models (LLM) in robotics systems, using them as interfaces to receive task commands, generate task plans, form team coalitions, and allocate tasks among multi-robot and human agents. However, despite their benefits, the growing adoption of LLM in robotics has raised several safety concerns, particularly regarding executing malicious or unsafe natural language prompts. In addition, ensuring that task plans, team formation, and task allocation outputs from LLMs are adequately examined, refined, or rejected is crucial for maintaining system integrity. In this paper, we introduce SafePlan, a multi-component framework that combines formal logic and chain-of-thought reasoners for enhancing the safety of LLM-based robotics systems. Using the components of SafePlan, including Prompt Sanity COT Reasoner and Invariant, Precondition, and Postcondition COT reasoners, we examined the safety of natural language task prompts, task plans, and task allocation outputs generated by LLM-based robotic systems as means of investigating and enhancing system safety profile. Our results show that SafePlan outperforms baseline models by leading to 90.5% reduction in harmful task prompt acceptance while still maintaining reasonable acceptance of safe tasks.
Chinese: SafePlan框架通过结合形式逻辑和思维链推理器,显著提升了基于大语言模型的机器人系统的安全性,将有害任务提示的接受率降低了90.5%,同时保持了对安全任务的合理接受水平。
English: SafePlan is a framework that enhances the safety of LLM-based robotics systems by using formal logic and chain-of-thought reasoners to scrutinize and reduce harmful task prompt acceptance by 90.5% while maintaining safe task performance.

Authors:Ashwinee Panda, Xinyu Tang, Milad Nasr, Christopher A. Choquette-Choo, Prateek Mittal
Title: Privacy Auditing of Large Language Models
Abstract:
Current techniques for privacy auditing of large language models (LLMs) have limited efficacy -- they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries surpass prior approaches. For example, on the Qwen2.5-0.5B model, our designed canaries achieve $49.6\%$ TPR at $1\%$ FPR, vastly surpassing the prior approach's $4.2\%$ TPR at $1\%$ FPR. Our method can be used to provide a privacy audit of $\varepsilon \approx 1$ for a model trained with theoretical $\varepsilon$ of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.
Chinese: 本研究开发了高效能的测试标记,大幅提升大型语言模型的隐私审计效果,在无需攻击者具备广泛能力的情况下,实现了前所未有的隐私泄露检测率,并在实际威胁模型中展示了切实可行的审计成果。
English: This study introduces highly effective canaries that significantly improve privacy auditing for large language models, achieving unprecedented detection rates and demonstrating practical auditing success under realistic threat models without requiring extensive attacker capabilities.

Authors:Jing Zhang, Zhikai Li, Qingyi Gu
Title: SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model
Abstract:
Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme outliers, and we find that aggressive clipping (ranging down to even 100$\times$), instead of smoothing or isolation, is effective in suppressing outliers while maintaining semantic capabilities. Unfortunately, traditional metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing reconstruction methods potentially neglect prompts' intention, resulting in distorted visual encodings during prompt interactions. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ of SAM with semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap as clipping metric, to significantly suppress outliers. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates visual-prompt interactions by leveraging cross-attention responses in mask decoder, thus facilitating alignment in both distribution and semantics. To ensure the interaction efficiency, we also introduce a layer-skipping strategy for visual tokens. Extensive experiments are conducted on different segmentation tasks and SAMs of various sizes, and the results show that the proposed SAQ-SAM consistently outperforms baselines. For example, when quantizing SAM-B to 4-bit, our method achieves 11.7% higher mAP than the baseline in instance segmentation task.
SAM模型在边缘部署中存在计算成本高的问题,而提出的SAQ-SAM方法通过感知一致性剪裁和提示感知重建来优化后训练量化,有效保持了分割精度。
The Segment Anything Model (SAM) faces computational challenges for edge deployment, but the proposed SAQ-SAM method enhances post-training quantization by using perceptual-consistent clipping and prompt-aware reconstruction to maintain segmentation accuracy efficiently.

Authors:Chuheng Wei, Ziye Qin, Siyan Li, Ziyan Zhang, Xuanpeng Zhao, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Matthew J. Barth, Guoyuan Wu
Title: PDB: Not All Drivers Are the Same -- A Personalized Dataset for Understanding Driving Behavior
Abstract:
Driving behavior is inherently personal, influenced by individual habits, decision-making styles, and physiological states. However, most existing datasets treat all drivers as homogeneous, overlooking driver-specific variability. To address this gap, we introduce the Personalized Driving Behavior (PDB) dataset, a multi-modal dataset designed to capture personalization in driving behavior under naturalistic driving conditions. Unlike conventional datasets, PDB minimizes external influences by maintaining consistent routes, vehicles, and lighting conditions across sessions. It includes sources from 128-line LiDAR, front-facing camera video, GNSS, 9-axis IMU, CAN bus data (throttle, brake, steering angle), and driver-specific signals such as facial video and heart rate. The dataset features 12 participants, approximately 270,000 LiDAR frames, 1.6 million images, and 6.6 TB of raw sensor data. The processed trajectory dataset consists of 1,669 segments, each spanning 10 seconds with a 0.2-second interval. By explicitly capturing drivers' behavior, PDB serves as a unique resource for human factor analysis, driver identification, and personalized mobility applications, contributing to the development of human-centric intelligent transportation systems.
中文: PDB数据集通过采集受控条件下的多模态驾驶数据,弥补了现有数据忽视驾驶员个体差异的不足,为以人为本的智能交通系统研究提供了独特资源。
English: The Personalized Driving Behavior (PDB) dataset addresses the lack of driver-specific data by capturing multi-modal information under controlled conditions, serving as a key resource for human-centric transportation research.

Authors:Xiaohong Yang, Minghui Liwang, Liqun Fu, Yuhan Su, Seyyedali Hosseinalipour, Xianbin Wang, Yiguang Hong
Title: Adaptive UAV-Assisted Hierarchical Federated Learning: Optimizing Energy, Latency, and Resilience for Dynamic Smart IoT
Abstract:
A key application of HFL lies in smart Internet of Things (IoT) systems, including remote monitoring and battlefield operations, where cellular connectivity is often unavailable. In such scenarios, UAVs can act as mobile aggregators, dynamically providing connectivity to terrestrial IoT devices. Subsequently, this paper investigates an HFL architecture enabled by energy-constrained, dynamically deployed UAVs that are susceptible to communication disruptions. We propose a novel approach to minimize global training costs in such environments by formulating a joint optimization problem that integrates learning configuration, bandwidth allocation, and IoT device-to-UAV association, ensuring timely global aggregation before UAV disconnections and redeployments. The problem explicitly captures the dynamic nature of IoT devices and their intermittent connectivity to UAVs and is shown to be NP-hard. To address its complexity, we decompose the problem into three interrelated subproblems. First, we optimize learning configuration and bandwidth allocation using an augmented Lagrangian function to reduce training costs. Second, we introduce a device fitness score that accounts for data heterogeneity (via Kullback-Leibler divergence), device-to-UAV proximity, and computational resources, leveraging a Twin Delayed Deep Deterministic Policy Gradient (TD3)-based algorithm for adaptive device-to-UAV assignment. Third, we develop a low-complexity two-stage greedy strategy for UAV redeployment and global aggregator selection, ensuring efficient model aggregation despite UAV disconnections.
中文摘要:本文提出了一种在连接不可靠的物联网系统中使用无人机作为移动聚合器的分层联邦学习架构,通过分解学习配置、带宽分配和设备关联三个子问题,解决了最小化训练成本的NP难优化问题。
English Summary: This paper proposes a hierarchical federated learning architecture using UAVs as mobile aggregators in IoT systems with unreliable connectivity, addressing the NP-hard optimization problem of minimizing training costs through decomposed subproblems involving learning configuration, bandwidth allocation, and device association.

Authors:Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki
Title: Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models
Abstract:
Self-Correction based on feedback improves the output quality of Large Language Models (LLMs). Moreover, as Self-Correction functions like the slow and conscious System-2 thinking from cognitive psychology's perspective, it can potentially reduce LLMs' social biases. LLMs are sensitive to contextual ambiguities and inconsistencies; therefore, explicitly communicating their intentions during interactions when applying Self-Correction for debiasing is crucial. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models for response and feedback generation.
中文: 通过包含明确去偏意图的结构化提示、思维链推理和多维度反馈,大语言模型的自我纠正机制能有效减少社会偏见,并比基准方法更稳健地提升输出一致性。
English: Self-correction in large language models, incorporating explicit debiasing intentions through structured prompts, chain-of-thought reasoning, and multi-aspect feedback, effectively reduces social biases and enhances output consistency compared to baseline methods.

Authors:Zekai Liang, Zih-Yun Chiu, Florian Richter, Michael C. Yip
Title: Differentiable Rendering-based Pose Estimation for Surgical Robotic Instruments
Abstract:
Robot pose estimation is a challenging and crucial task for vision-based surgical robotic automation. Typical robotic calibration approaches, however, are not applicable to surgical robots, such as the da Vinci Research Kit (dVRK), due to joint angle measurement errors from cable-drives and the partially visible kinematic chain. Hence, previous works in surgical robotic automation used tracking algorithms to estimate the pose of the surgical tool in real-time and compensate for the joint angle errors. However, a big limitation of these previous tracking works is the initialization step which relied on only keypoints and SolvePnP. In this work, we fully explore the potential of geometric primitives beyond just keypoints with differentiable rendering, cylinders, and construct a versatile pose matching pipeline in a novel pose hypothesis space. We demonstrate the state-of-the-art performance of our single-shot calibration method with both calibration consistency and real surgical tasks. As a result, this marker-less calibration approach proves to be a robust and generalizable initialization step for surgical tool tracking.
Chinese: 本研究提出了一种无标记标定方法,通过几何基元和可微分渲染技术实现手术机器人姿态估计的最新性能,有效解决了以往跟踪方法在初始化环节的局限性。
English: This work introduces a marker-less calibration method for surgical robots that utilizes geometric primitives and differentiable rendering to achieve state-of-the-art pose estimation, overcoming initialization limitations of previous tracking approaches.

Authors:A. Quadir, M. Tanveer
Title: Randomized based restricted kernel machine for hyperspectral image classification
Abstract:
In recent years, the random vector functional link (RVFL) network has gained significant popularity in hyperspectral image (HSI) classification due to its simplicity, speed, and strong generalization performance. However, despite these advantages, RVFL models face several limitations, particularly in handling non-linear relationships and complex data structures. The random initialization of input-to-hidden weights can lead to instability, and the model struggles with determining the optimal number of hidden nodes, affecting its performance on more challenging datasets. To address these issues, we propose a novel randomized based restricted kernel machine ($R^2KM$) model that combines the strehyperngths of RVFL and restricted kernel machines (RKM). $R^2KM$ introduces a layered structure that represents kernel methods using both visible and hidden variables, analogous to the energy function in restricted Boltzmann machines (RBM). This structure enables $R^2KM$ to capture complex data interactions and non-linear relationships more effectively, improving both interpretability and model robustness. A key contribution of $R^2KM$ is the introduction of a novel conjugate feature duality based on the Fenchel-Young inequality, which expresses the problem in terms of conjugate dual variables and provides an upper bound on the objective function. This duality enhances the model's flexibility and scalability, offering a more efficient and flexible solution for complex data analysis tasks. Extensive experiments on hyperspectral image datasets and real-world data from the UCI and KEEL repositories show that $R^2KM$ outperforms baseline models, demonstrating its effectiveness in classification and regression tasks.
Chinese: 提出的$R^2KM$模型结合了随机向量函数链接网络和受限核机器的优势,通过引入分层结构和共轭特征对偶性,有效提升了处理复杂数据和非线性关系的能力,在分类和回归任务中表现出优于基线模型的性能。
English: The proposed $R^2KM$ model integrates the strengths of RVFL networks and restricted kernel machines to overcome limitations in handling complex data structures, demonstrating superior performance in hyperspectral image classification and real-world datasets through enhanced interpretability and robustness.

Authors:Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Alois Knoll
Title: LLM-based Iterative Approach to Metamodeling in Automotive
Abstract:
In this paper, we introduce an automated approach to domain-specific metamodel construction relying on Large Language Model (LLM). The main focus is adoption in automotive domain. As outcome, a prototype was implemented as web service using Python programming language, while OpenAI's GPT-4o was used as the underlying LLM. Based on the initial experiments, this approach successfully constructs Ecore metamodel based on set of automotive requirements and visualizes it making use of PlantUML notation, so human experts can provide feedback in order to refine the result. Finally, locally deployable solution is also considered, including the limitations and additional steps required.
中文摘要:本文提出一种利用大型语言模型自动构建领域特定元模型的方法,针对汽车领域开发了基于GPT-4o的网页服务原型,能根据需求生成Ecore元模型并通过PlantUML可视化,供专家评审完善,同时探讨了本地化部署方案及其限制条件。
English Summary: This paper presents an automated method using Large Language Models (LLMs) to construct domain-specific metamodels in the automotive sector, implementing a web-based prototype with GPT-4o that generates Ecore metamodels from requirements and visualizes them via PlantUML for expert feedback.

Authors:Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang, Hao Jiang
Title: CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
Abstract:
While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.
中文: 提出的复杂多模态思维链框架通过模拟人类慢思考的跨图像视觉比较和动态记忆机制,引入交错推理链和记忆增强模块,有效解决了多图像理解任务中的现有局限性。
English: The proposed Complex Multi-Modal Chain-of-Thought (CMMCoT) framework addresses limitations in multi-image understanding by mimicking human slow-thinking through cross-image visual comparisons and dynamic memorization, introducing interleaved reasoning chains and a memory module to enhance performance.

Authors:Mehmet Kardan, Bhawna Piryani, Adam Jatowt
Title: Evaluating Answer Reranking Strategies in Time-sensitive Question Answering
Abstract:
Despite advancements in state-of-the-art models and information retrieval techniques, current systems still struggle to handle temporal information and to correctly answer detailed questions about past events. In this paper, we investigate the impact of temporal characteristics of answers in Question Answering (QA) by exploring several simple answer selection techniques. Our findings emphasize the role of temporal features in selecting the most relevant answers from diachronic document collections and highlight differences between explicit and implicit temporal questions.
中文: 本研究通过评估简单的答案选择方法,探讨了时间特征在问答系统中的作用,揭示了其在处理历史文档中显性和隐性时间问题上的重要性。
English: This study explores the impact of temporal features in question answering by evaluating simple answer selection methods, revealing their critical role in handling both explicit and implicit temporal questions from historical documents.

Authors:Changchang Yin, Hong-You Chen, Wei-Lun Chao, Ping Zhang
Title: Federated Inverse Probability Treatment Weighting for Individual Treatment Effect Estimation
Abstract:
Individual treatment effect (ITE) estimation is to evaluate the causal effects of treatment strategies on some important outcomes, which is a crucial problem in healthcare. Most existing ITE estimation methods are designed for centralized settings. However, in real-world clinical scenarios, the raw data are usually not shareable among hospitals due to the potential privacy and security risks, which makes the methods not applicable. In this work, we study the ITE estimation task in a federated setting, which allows us to harness the decentralized data from multiple hospitals. Due to the unavoidable confounding bias in the collected data, a model directly learned from it would be inaccurate. One well-known solution is Inverse Probability Treatment Weighting (IPTW), which uses the conditional probability of treatment given the covariates to re-weight each training example. Applying IPTW in a federated setting, however, is non-trivial. We found that even with a well-estimated conditional probability, the local model training step using each hospital's data alone would still suffer from confounding bias. To address this, we propose FED-IPTW, a novel algorithm to extend IPTW into a federated setting that enforces both global (over all the data) and local (within each hospital) decorrelation between covariates and treatments. We validated our approach on the task of comparing the treatment effects of mechanical ventilation on improving survival probability for patients with breadth difficulties in the intensive care unit (ICU). We conducted experiments on both synthetic and real-world eICU datasets and the results show that FED-IPTW outperform state-of-the-art methods on all the metrics on factual prediction and ITE estimation tasks, paving the way for personalized treatment strategy design in mechanical ventilation usage.
中文: 本研究提出FED-IPTW算法,将逆概率加权方法扩展至联邦学习框架,通过全局与局部去相关处理解决多医院数据中的混淆偏差问题,在ICU机械通气治疗效果预测任务中显著优于现有方法。
English: This study introduces FED-IPTW, a novel federated learning algorithm that extends Inverse Probability Treatment Weighting to mitigate confounding bias in individual treatment effect estimation across multiple hospitals without sharing raw data, demonstrating superior performance in predicting mechanical ventilation outcomes on ICU datasets.

Authors:Jianlong Zhou, Fang Chen
Title: E-LENS: User Requirements-Oriented AI Ethics Assurance
Abstract:
Despite the much proliferation of AI ethical principles in recent years, there is a challenge of assuring AI ethics with current AI ethics frameworks in real-world applications. While system safety has emerged as a distinct discipline for a long time, originated from safety concerns in early aircraft manufacturing. The safety assurance is now an indispensable component in safety critical domains. Motivated by the assurance approaches for safety-critical systems such as aviation, this paper introduces the concept of AI ethics assurance cases into the AI ethics assurance. Three pillars of user requirements, evidence, and validation are proposed as key components and integrated into AI ethics assurance cases for a new approach of user requirements-oriented AI ethics assurance. The user requirements-oriented AI ethics assurance case is set up based on three pillars and hazard analysis methods used in the safety assurance of safety-critical systems. This paper also proposes a platform named Ethical-Lens (E-LENS) to implement the user requirements-oriented AI ethics assurance approach. The proposed user requirements-based E-LENS platform is then applied to assure AI ethics of an AI-driven human resource shortlisting system as a case study to show the effectiveness of the proposed approach.
中文摘要:本文提出了一种以用户需求为导向的人工智能伦理保障框架,结合安全关键系统方法并通过人力资源筛选案例验证了Ethical-Lens平台的有效性。
English Summary: This paper introduces a user requirements-oriented AI ethics assurance framework, integrating safety-critical system approaches with the Ethical-Lens platform to validate effectiveness through an HR recruitment case study.

Authors:Nenad Petrovic, Yurui Zhang, Moaad Maaroufi, Kuo-Yi Chao, Lukasz Mazur, Fengjunjie Pan, Vahid Zolfaghari, Alois Knoll
Title: Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study
Abstract:
Multimodal summarization integrating information from diverse data modalities presents a promising solution to aid the understanding of information within various processes. However, the application and advantages of multimodal summarization have not received much attention in model-based engineering (MBE), where it has become a cornerstone in the design and development of complex systems, leveraging formal models to improve understanding, validation and automation throughout the engineering lifecycle. UML and EMF diagrams in model-based engineering contain a large amount of multimodal information and intricate relational data. Hence, our study explores the application of multimodal large language models within the domain of model-based engineering to evaluate their capacity for understanding and identifying relationships, features, and functionalities embedded in UML and EMF diagrams. We aim to demonstrate the transformative potential benefits and limitations of multimodal summarization in improving productivity and accuracy in MBE practices. The proposed approach is evaluated within the context of automotive software development, while many promising state-of-art models were taken into account.
Chinese: 本研究探讨了在基于模型的工程中应用多模态大语言模型,以提升对UML和EMF图的理解,旨在提高汽车软件开发的生产效率和准确性。
English: This study investigates the use of multimodal large language models in model-based engineering to enhance understanding of UML and EMF diagrams, aiming to improve productivity and accuracy in automotive software development.

Authors:Maxime Di Folco, Emily Chan, Marta Hasny, Cosmin I. Bercea, Julia A. Schnabel
Title: Semantic Alignment of Unimodal Medical Text and Vision Representations
Abstract:
General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
中文: 通用AI模型通过语义对齐和模型拼接技术,能有效融合专业医学知识,无需额外训练即可提升胸部X光分析等任务的性能,同时提出的零样本分类方法可媲美专业多模态解决方案。
English: General-purpose AI models can be effectively integrated with specialized medical knowledge through semantic alignment and model stitching, enabling improved performance on tasks like chest X-ray analysis without additional training, while also introducing a zero-shot classification method that rivals specialized multimodal solutions.

Authors:Bin Li, Haichen Cai, Lei Liu, Zesong Fei
Title: Delay-Aware Digital Twin Synchronization in Mobile Edge Networks with Semantic Communications
Abstract:
The synchronization of digital twins (DT) serves as the cornerstone for effective operation of the DT framework. However, the limitations of channel capacity can greatly affect the data transmission efficiency of wireless communication. Unlike traditional communication methods, semantic communication transmits the intended meanings of physical objects instead of raw data, effectively saving bandwidth resource and reducing DT synchronization latency. Hence, we are committed to integrating semantic communication into the DT synchronization framework within the mobile edge computing system, aiming to enhance the DT synchronization efficiency of user devices (UDs). Our goal is to minimize the average DT synchronization latency of all UDs by jointly optimizing the synchronization strategy, transmission power of UDs, and computational resource allocation for both UDs and base station. The formulated problem involves sequential decision-making across multiple coherent time slots. Furthermore, the mobility of UDs introduces uncertainties into the decision-making process. To solve this challenging optimization problem efficiently, we propose a soft actor-critic-based deep reinforcement learning algorithm to optimize synchronization strategy and resource allocation. Numerical results demonstrate that our proposed algorithm can reduce synchronization latency by up to 13.2\% and improve synchronization efficiency compared to other benchmark schemes.
Chinese: 本研究将语义通信融入移动边缘计算系统中的数字孪生同步框架,通过深度强化学习算法联合优化策略与资源,旨在最小化同步延迟,实现了高达13.2%的延迟降低。
English: This study integrates semantic communication into the digital twin synchronization framework within mobile edge computing systems, aiming to minimize synchronization latency through joint optimization of strategies and resources using a deep reinforcement learning algorithm, which achieves up to 13.2% latency reduction.

Authors:Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duang, Si-Cheng Wang, Zheng Lei, Zeng-Guang Hou
Title: VLA Model-Expert Collaboration for Bi-directional Manipulation Learning
Abstract:
The emergence of vision-language-action (VLA) models has given rise to foundation models for robot manipulation. Although these models have achieved significant improvements, their generalization in multi-task manipulation remains limited. This study proposes a VLA model-expert collaboration framework that leverages a limited number of expert actions to enhance VLA model performance. This approach reduces expert workload relative to manual operation while simultaneously improving the reliability and generalization of VLA models. Furthermore, manipulation data collected during collaboration can further refine the VLA model, while human participants concurrently enhance their skills. This bi-directional learning loop boosts the overall performance of the collaboration system. Experimental results across various VLA models demonstrate the effectiveness of the proposed system in collaborative manipulation and learning, as evidenced by improved success rates across tasks. Additionally, validation using a brain-computer interface (BCI) indicates that the collaboration system enhances the efficiency of low-speed action systems by involving VLA model during manipulation. These promising results pave the way for advancing human-robot interaction in the era of foundation models for robotics. (Project website: https://aoqunjin.github.io/Expert-VLA/)
中文总结:本研究提出了一种视觉语言动作模型与专家协作框架,通过双向学习循环提升机器人操作性能,在多任务测试和脑机接口验证中均显示出更高的成功率和系统效率。
English Summary: This study introduces a VLA model-expert collaboration framework that enhances robotic manipulation performance through bidirectional learning, improving task success rates and system efficiency as validated across multiple VLA models and BCI tests.

Authors:Devanish N. Kamtam, Joseph B. Shrager, Satya Deepya Malla, Xiaohan Wang, Nicole Lin, Juan J. Cardona, Serena Yeung-Levy, Clarence Hu
Title: SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection
Abstract:
Background: We evaluate SAM 2 for surgical scene understanding by examining its semantic segmentation capabilities for organs/tissues both in zero-shot scenarios and after fine-tuning. Methods: We utilized five public datasets to evaluate and fine-tune SAM 2 for segmenting anatomical tissues in surgical videos/images. Fine-tuning was applied to the image encoder and mask decoder. We limited training subsets from 50 to 400 samples per class to better model real-world constraints with data acquisition. The impact of dataset size on fine-tuning performance was evaluated with weighted mean Dice coefficient (WMDC), and the results were also compared against previously reported state-of-the-art (SOTA) results. Results: SurgiSAM 2, a fine-tuned SAM 2 model, demonstrated significant improvements in segmentation performance, achieving a 17.9% relative WMDC gain compared to the baseline SAM 2. Increasing prompt points from 1 to 10 and training data scale from 50/class to 400/class enhanced performance; the best WMDC of 0.92 on the validation subset was achieved with 10 prompt points and 400 samples per class. On the test subset, this model outperformed prior SOTA methods in 24/30 (80%) of the classes with a WMDC of 0.91 using 10-point prompts. Notably, SurgiSAM 2 generalized effectively to unseen organ classes, achieving SOTA on 7/9 (77.8%) of them. Conclusion: SAM 2 achieves remarkable zero-shot and fine-tuned performance for surgical scene segmentation, surpassing prior SOTA models across several organ classes of diverse datasets. This suggests immense potential for enabling automated/semi-automated annotation pipelines, thereby decreasing the burden of annotations facilitating several surgical applications.
中文: 经过微调的SAM 2模型SurgiSAM 2在手术场景分割中表现卓越,加权平均Dice系数提升17.9%,优于现有最优方法,并对未见器官类别展现出优秀的泛化能力。
English: The fine-tuned SAM 2 model, SurgiSAM 2, significantly enhances surgical scene segmentation with a 17.9% improvement in WMDC, outperforming prior state-of-the-art methods and demonstrating strong generalization to unseen organ classes.

Authors:Meihui Liu, Shu Sun, Ruifeng Gao, Meixia Tao
Title: Beamforming Design for ISAC Systems with Suppressed Range-Angle Sidelobes
Abstract:
Integrated sensing and communication (ISAC) represents a pivotal advancement for future wireless networks. This paper introduces a novel ISAC beamforming method for enhancing sensing performance while preserving communication quality by leveraging the ambiguity function (AF). We formulate an optimization problem to minimize the integrated sidelobe level ratio (ISLR) of the AF subject to the constraints of transmission power, communication signalto-interference-plus-noise ratio, and sensing gain. To address the non-convexity of the optimization problem, semidefinite relaxation is adopted. Numerical results show that our method significantly reduces range sidelobes and achieves a lower ISLR in the rangeangle domain compared to other approaches.
中文: 本文提出了一种新颖的通感一体化波束赋形方法,通过半定松弛技术最小化积分旁瓣比,在保证通信质量的同时显著提升了感知性能。
English: This paper proposes a novel integrated sensing and communication beamforming method that minimizes the integrated sidelobe level ratio using semidefinite relaxation, significantly improving sensing performance while maintaining communication quality.

Authors:Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel
Title: OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Abstract:
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.
中文:OTTER是一种新颖的视觉-语言-动作模型,通过选择性提取与语言指令语义对齐的任务相关视觉特征,无需微调即可保持预训练语义对齐,展现出强大的零样本泛化能力。
English: OTTER is a novel Vision-Language-Action model that preserves pre-trained semantic alignments by selectively extracting task-relevant visual features aligned with language instructions, enabling strong zero-shot generalization without fine-tuning.

Authors:Phuoc Nguyen, Francesco Verdoja, Ville Kyrki
Title: REACT: Real-time Efficient Attribute Clustering and Transfer for Updatable 3D Scene Graph
Abstract:
Modern-day autonomous robots need high-level map representations to perform sophisticated tasks. Recently, 3D scene graphs (3DSGs) have emerged as a promising alternative to traditional grid maps, blending efficient memory use and rich feature representation. However, most efforts to apply them have been limited to static worlds. This work introduces REACT, a framework that efficiently performs real-time attribute clustering and transfer to relocalize object nodes in a 3DSG. REACT employs a novel method for comparing object instances using an embedding model trained on triplet loss, facilitating instance clustering and matching. Experimental results demonstrate that REACT is able to relocalize objects while maintaining computational efficiency. The REACT framework's source code will be available as an open-source project, promoting further advancements in reusable and updatable 3DSGs.
中文:REACT是一种实时框架,通过高效的属性聚类与迁移技术,利用嵌入模型进行实例匹配,实现了动态物体在3D场景图中的重定位,同时保持了计算效率。
English: REACT is a real-time framework that enhances 3D scene graphs by enabling dynamic object relocalization through efficient attribute clustering and transfer, using an embedding model for instance matching while maintaining computational efficiency.

Authors:Chenhui Xu, Dancheng Liu, Jiajie Li, Amir Nassereldine, Zhaohui Li, Jinjun Xiong
Title: Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability
Abstract:
Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model's context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.
Chinese: 近期多轮推理技术如思维链和自我优化提升了大型语言模型在复杂任务中的表现,本研究通过证明Transformer的通用逼近能力、可学习性及误差控制,奠定了其系统性的理论基础。
English: Recent advances in multi-round reasoning for LLMs, such as Chain-of-Thought and self-refinement, have enhanced performance in complex tasks, and this study establishes their theoretical foundations by demonstrating Transformers' universal approximation capabilities, learnability, and error control in sequence generation.

Authors:Yuyan Ni, Shikun Feng, Haohan Chi, Bowen Zheng, Huan-ang Gao, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan
Title: Straight-Line Diffusion Model for Efficient 3D Molecular Generation
Abstract:
Diffusion-based models have shown great promise in molecular generation but often require a large number of sampling steps to generate valid samples. In this paper, we introduce a novel Straight-Line Diffusion Model (SLDM) to tackle this problem, by formulating the diffusion process to follow a linear trajectory. The proposed process aligns well with the noise sensitivity characteristic of molecular structures and uniformly distributes reconstruction effort across the generative process, thus enhancing learning efficiency and efficacy. Consequently, SLDM achieves state-of-the-art performance on 3D molecule generation benchmarks, delivering a 100-fold improvement in sampling efficiency.
中文摘要:直线扩散模型(SLDM)通过采用线性扩散轨迹,显著提升了分子生成效率,在实现最优性能的同时将采样速度提高了100倍。
English Summary: The Straight-Line Diffusion Model (SLDM) introduces a linear diffusion trajectory that significantly enhances molecular generation efficiency, achieving state-of-the-art performance with a 100-fold improvement in sampling speed.

Authors:Theodore Zhao, Sid Kiblawi, Naoto Usuyama, Ho Hin Lee, Sam Preston, Hoifung Poon, Mu Wei
Title: Boltzmann Attention Sampling for Image Analysis with Small Objects
Abstract:
Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
中文: BoltzFormer通过采用带退火策略的玻尔兹曼分布实现动态稀疏注意力机制,有效聚焦于相关区域进行小物体检测与分割,在显著提升精度的同时将注意力计算量减少一个数量级。
English: BoltzFormer introduces a dynamic sparse attention mechanism using Boltzmann distribution with annealing to efficiently detect and segment small objects by focusing computational resources on relevant areas, significantly improving accuracy while reducing attention computations by an order of magnitude.

Authors:Paul Stangel, David Bani-Harouni, Chantal Pellegrini, Ege Özsoy, Kamilia Zaripova, Matthias Keicher, Nassir Navab
Title: Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
Abstract:
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. We provide our training and evaluation code in the supplementary and will make it publicly available upon acceptance.
Chinese: 本文提出一种强化学习方法,通过直接微调大语言模型在回答事实问题时生成校准的置信度估计,将置信度校准无缝集成到生成过程中,显著提升了模型的校准效果和泛化能力。
English: This paper introduces a reinforcement learning method to fine-tune Large Language Models for producing calibrated confidence estimates alongside factual answers, improving alignment between confidence and accuracy without decoupling from the generative process.

Authors:Aviv Shamsian, Eitan Shaar, Aviv Navon, Gal Chechik, Ethan Fetaya
Title: Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization
Abstract:
Machine unlearning aims to remove the influence of problematic training data after a model has been trained. The primary challenge in machine unlearning is ensuring that the process effectively removes specified data without compromising the model's overall performance on the remaining dataset. Many existing machine unlearning methods address this challenge by carefully balancing gradient ascent on the unlearn data with the gradient descent on a retain set representing the training data. Here, we propose OrthoGrad, a novel approach that mitigates interference between the unlearn set and the retain set rather than competing ascent and descent processes. Our method projects the gradient of the unlearn set onto the subspace orthogonal to all gradients in the retain batch, effectively avoiding any gradient interference. We demonstrate the effectiveness of OrthoGrad on multiple machine unlearning benchmarks, including automatic speech recognition, outperforming competing methods.
中文: OrthoGrad是一种创新的机器遗忘方法,通过将遗忘集的梯度投影到与保留集梯度正交的子空间来消除梯度干扰,在自动语音识别等基准测试中优于现有方法。
English: OrthoGrad is a novel machine unlearning method that eliminates gradient interference by projecting the unlearn set's gradient onto a subspace orthogonal to the retain set's gradients, outperforming existing approaches on benchmarks like automatic speech recognition.

Authors:Yunbo Long, Liming Xu, Alexandra Brintrup
Title: LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion
Abstract:
Synthetic tabular data are increasingly being used to replace real data, serving as an effective solution that simultaneously protects privacy and addresses data scarcity. However, in addition to preserving global statistical properties, synthetic datasets must also maintain domain-specific logical consistency**-**especially in complex systems like supply chains, where fields such as shipment dates, locations, and product categories must remain logically consistent for real-world usability. Existing generative models often overlook these inter-column relationships, leading to unreliable synthetic tabular data in real-world applications. To address these challenges, we propose LLM-TabLogic, a novel approach that leverages Large Language Model reasoning to capture and compress the complex logical relationships among tabular columns, while these conditional constraints are passed into a Score-based Diffusion model for data generation in latent space. Through extensive experiments on real-world industrial datasets, we evaluate LLM-TabLogic for column reasoning and data generation, comparing it with five baselines including SMOTE and state-of-the-art generative models. Our results show that LLM-TabLogic demonstrates strong generalization in logical inference, achieving over 90% accuracy on unseen tables. Furthermore, our method outperforms all baselines in data generation by fully preserving inter-column relationships while maintaining the best balance between data fidelity, utility, and privacy. This study presents the first method to effectively preserve inter-column relationships in synthetic tabular data generation without requiring domain knowledge, offering new insights for creating logically consistent real-world tabular data.
中文摘要:LLM-TabLogic是一种创新方法,通过大语言模型推理和基于分数的扩散模型生成合成表格数据,在保持列间复杂逻辑关系方面表现卓越,相比现有方法在数据一致性和实用性上更具优势。
English Summary: LLM-TabLogic is a novel method that uses Large Language Model reasoning and Score-based Diffusion to generate synthetic tabular data while preserving complex inter-column relationships, achieving superior performance in logical consistency and data utility compared to existing approaches.

Authors:Xiangrui Liu, Yuanyuan Zhang, Yingzhou Lu, Changchang Yin, Xiaoling Hu, Xiaoou Liu, Lulu Chen, Sheng Wang, Alexander Rodriguez, Huaxiu Yao, Yezhou Yang, Ping Zhang, Jintai Chen, Tianfan Fu, Xiao Wang
Title: Biomedical Foundation Model: A Survey
Abstract:
Foundation models, first introduced in 2021, are large-scale pre-trained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in leveraging artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models across diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.
中文: 基础模型是通过海量无标注数据预训练的大规模人工智能系统,在生物医学领域展现出推动医学研究和临床实践的巨大潜力。
English: Foundation models are large-scale pre-trained AI systems that learn from vast unlabeled data to excel across diverse applications, with biomedical versions showing significant potential to advance medical research and practice.

Authors:Brian Hu Zhang, Tao Lin, Yiling Chen, Tuomas Sandholm
Title: Learning a Game by Paying the Agents
Abstract:
We study the problem of learning the utility functions of agents in a normal-form game by observing the agents play the game repeatedly. Differing from most prior literature, we introduce a principal with the power to observe the agents playing the game, send the agents signals, and send the agents payments as a function of their actions. Under reasonable behavioral models for the agents such as iterated dominated action removal or a no-regret assumption, we show that the principal can, using a number of rounds polynomial in the size of the game, learn the utility functions of all agents to any desirable precision $\varepsilon > 0$. We also show lower bounds in both models, which nearly match the upper bounds in the former model and also strictly separate the two models: the principal can learn strictly faster in the iterated dominance model. Finally, we discuss implications for the problem of steering agents to a desired equilibrium: in particular, we introduce, using our utility-learning algorithm as a subroutine, the first algorithm for steering learning agents without prior knowledge of their utilities.
中文摘要:本研究探讨了在重复标准形式博弈中,一个拥有观察与信号发送能力的主体如何通过多项式轮次学习所有参与者的效用函数,并在迭代占优模型中展现出更快的收敛速度,同时为引导学习者达到期望均衡提供了新算法。
English Summary: This research explores how a principal can efficiently learn agents' utility functions in repeated normal-form games through observation and signaling, achieving polynomial-time learning under behavioral models while demonstrating faster convergence with iterated dominance.

Authors:Andrei Buliga, Chiara Di Francescomarino, Chiara Ghidini, Marco Montali, Massimiliano Ronzani
Title: Generating Counterfactual Explanations Under Temporal Constraints
Abstract:
Counterfactual explanations are one of the prominent eXplainable Artificial Intelligence (XAI) techniques, and suggest changes to input data that could alter predictions, leading to more favourable outcomes. Existing counterfactual methods do not readily apply to temporal domains, such as that of process mining, where data take the form of traces of activities that must obey to temporal background knowledge expressing which dynamics are possible and which not. Specifically, counterfactuals generated off-the-shelf may violate the background knowledge, leading to inconsistent explanations. This work tackles this challenge by introducing a novel approach for generating temporally constrained counterfactuals, guaranteed to comply by design with background knowledge expressed in Linear Temporal Logic on process traces (LTLp). We do so by infusing automata-theoretic techniques for LTLp inside a genetic algorithm for counterfactual generation. The empirical evaluation shows that the generated counterfactuals are temporally meaningful and more interpretable for applications involving temporal dependencies.
Chinese: 本研究提出了一种在过程挖掘中生成时间约束反事实解释的新方法,通过将自动机理论技术融入遗传算法,确保反事实符合时间背景知识。
English: This study introduces a novel method for generating temporally constrained counterfactual explanations in process mining, ensuring compliance with temporal background knowledge through automata-theoretic techniques integrated into a genetic algorithm.

Authors:Haichao Liu, Sikai Guo, Pengfei Mai, Jiahang Cao, Haoang Li, Jun Ma
Title: RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation
Abstract:
This paper introduces RoboDexVLM, an innovative framework for robot task planning and grasp detection tailored for a collaborative manipulator equipped with a dexterous hand. Previous methods focus on simplified and limited manipulation tasks, which often neglect the complexities associated with grasping a diverse array of objects in a long-horizon manner. In contrast, our proposed framework utilizes a dexterous hand capable of grasping objects of varying shapes and sizes while executing tasks based on natural language commands. The proposed approach has the following core components: First, a robust task planner with a task-level recovery mechanism that leverages vision-language models (VLMs) is designed, which enables the system to interpret and execute open-vocabulary commands for long sequence tasks. Second, a language-guided dexterous grasp perception algorithm is presented based on robot kinematics and formal methods, tailored for zero-shot dexterous manipulation with diverse objects and commands. Comprehensive experimental results validate the effectiveness, adaptability, and robustness of RoboDexVLM in handling long-horizon scenarios and performing dexterous grasping. These results highlight the framework's ability to operate in complex environments, showcasing its potential for open-vocabulary dexterous manipulation. Our open-source project page can be found at https://henryhcliu.github.io/robodexvlm.
Chinese: 本文提出RoboDexVLM框架,通过灵巧手实现多样化物体的抓取和基于自然语言指令的长序列任务规划,实验验证了其在复杂环境中的有效性和适应性。
English: This paper presents RoboDexVLM, a framework for robot task planning and grasp detection using a dexterous hand to handle diverse objects and execute long-horizon tasks based on natural language commands, validated by experiments showing its effectiveness and adaptability.

Authors:Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladyslav Rozov, Angela Dai, Matthias Nießner
Title: MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing
Abstract:
We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.
中文:MeshPad是一种通过顶点对齐推测预测策略从草图交互式生成3D网格的方法,在网格质量和用户偏好方面均优于现有技术。
English: MeshPad is a generative method that enables interactive 3D mesh creation from sketches through a vertex-aligned speculative prediction strategy, achieving superior quality and user preference over existing techniques.

Authors:Sakiko Yahata, Zhen Wan, Fei Cheng, Sadao Kurohashi, Hisahiko Sato, Ryozo Nagai
Title: Causal Tree Extraction from Medical Case Reports: A Novel Task for Experts-like Text Comprehension
Abstract:
Extracting causal relationships from a medical case report is essential for comprehending the case, particularly its diagnostic process. Since the diagnostic process is regarded as a bottom-up inference, causal relationships in cases naturally form a multi-layered tree structure. The existing tasks, such as medical relation extraction, are insufficient for capturing the causal relationships of an entire case, as they treat all relations equally without considering the hierarchical structure inherent in the diagnostic process. Thus, we propose a novel task, Causal Tree Extraction (CTE), which receives a case report and generates a causal tree with the primary disease as the root, providing an intuitive understanding of a case's diagnostic process. Subsequently, we construct a Japanese case report CTE dataset, J-Casemap, propose a generation-based CTE method that outperforms the baseline by 20.2 points in the human evaluation, and introduce evaluation metrics that reflect clinician preferences. Further experiments also show that J-Casemap enhances the performance of solving other medical tasks, such as question answering.
中文摘要:本研究提出了因果树提取(CTE)这一新任务,通过构建层次化因果树来更直观地呈现医疗案例的诊断过程,并基于日语数据集验证了该方法在提升医疗任务性能方面的有效性。
English Summary: The study introduces Causal Tree Extraction (CTE), a novel task that constructs hierarchical causal trees from medical case reports to better represent the diagnostic process, and demonstrates its effectiveness through a Japanese dataset and improved performance in medical tasks.

Authors:Qi Li, Runpeng Yu, Xinchao Wang
Title: Multi-Level Collaboration in Model Merging
Abstract:
Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling-commonly viewed as the upper bound for merging-to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between model merging and model ensembling, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed Neural Ligand (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44% vs. ensembling: 95.46%).
Chinese: 本文探讨了在解除先前限制条件下,参数级模型融合能否与预测级集成保持性能一致,并提出了神经配体框架,验证了融合在不同模型规模和数量下均可达到与集成近乎相同的性能表现。
English: This paper explores whether parameter-level model merging can achieve performance consistency with prediction-level ensembling without previous restrictions, introducing the Neural Ligand framework that demonstrates merging can attain near-identical performance to ensembling across varying model scales and quantities.

Authors:Kishalay Das, Subhojyoti Khastagir, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Title: Periodic Materials Generation using Text-Guided Joint Diffusion Model
Abstract:
Equivariant diffusion models have emerged as the prevailing approach for generating novel crystal materials due to their ability to leverage the physical symmetries of periodic material structures. However, current models do not effectively learn the joint distribution of atom types, fractional coordinates, and lattice structure of the crystal material in a cohesive end-to-end diffusion framework. Also, none of these models work under realistic setups, where users specify the desired characteristics that the generated structures must match. In this work, we introduce TGDMat, a novel text-guided diffusion model designed for 3D periodic material generation. Our approach integrates global structural knowledge through textual descriptions at each denoising step while jointly generating atom coordinates, types, and lattice structure using a periodic-E(3)-equivariant graph neural network (GNN). Extensive experiments using popular datasets on benchmark tasks reveal that TGDMat outperforms existing baseline methods by a good margin. Notably, for the structure prediction task, with just one generated sample, TGDMat outperforms all baseline models, highlighting the importance of text-guided diffusion. Further, in the generation task, TGDMat surpasses all baselines and their text-fusion variants, showcasing the effectiveness of the joint diffusion paradigm. Additionally, incorporating textual knowledge reduces overall training and sampling computational overhead while enhancing generative performance when utilizing real-world textual prompts from experts.
中文摘要:TGDMat是一种新型文本引导扩散模型,通过周期性E(3)等变图神经网络联合生成晶体材料的原子类型、坐标和晶格结构,在结构预测和材料生成任务中均显著优于现有基准方法。
English Summary: TGDMat is a novel text-guided diffusion model that jointly generates crystal materials' atom types, coordinates, and lattice structures using periodic-E(3)-equivariant GNNs, outperforming existing methods in structure prediction and material generation tasks.

Authors:Weihao Lu, Haobo Zhang, Yicheng Li, Qian Lin
Title: On the Saturation Effects of Spectral Algorithms in Large Dimensions
Abstract:
The saturation effects, which originally refer to the fact that kernel ridge regression (KRR) fails to achieve the information-theoretical lower bound when the regression function is over-smooth, have been observed for almost 20 years and were rigorously proved recently for kernel ridge regression and some other spectral algorithms over a fixed dimensional domain. The main focus of this paper is to explore the saturation effects for a large class of spectral algorithms (including the KRR, gradient descent, etc.) in large dimensional settings where $n \asymp d^γ$. More precisely, we first propose an improved minimax lower bound for the kernel regression problem in large dimensional settings and show that the gradient flow with early stopping strategy will result in an estimator achieving this lower bound (up to a logarithmic factor). Similar to the results in KRR, we can further determine the exact convergence rates (both upper and lower bounds) of a large class of (optimal tuned) spectral algorithms with different qualification $τ$'s. In particular, we find that these exact rate curves (varying along $γ$) exhibit the periodic plateau behavior and the polynomial approximation barrier. Consequently, we can fully depict the saturation effects of the spectral algorithms and reveal a new phenomenon in large dimensional settings (i.e., the saturation effect occurs in large dimensional setting as long as the source condition $s>τ$ while it occurs in fixed dimensional setting as long as $s>2τ$).
Chinese: 本文研究了高维设置下谱算法的饱和效应,发现当源条件超过算法资格时会出现饱和现象,并建立了改进的极小极大下界和具有周期性平台行为的精确收敛速率。
English: This paper investigates saturation effects in spectral algorithms within high-dimensional settings, revealing that these effects occur when the source condition exceeds the algorithm's qualification, and it establishes improved minimax bounds and exact convergence rates that exhibit periodic plateau behavior.

Authors:Yuping Wang, Xiangyu Huang, Xiaokang Sun, Mingxuan Yan, Shuo Xing, Zhengzhong Tu, Jiachen Li
Title: UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
Abstract:
We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. UniOcc unifies the data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), providing 2D/3D occupancy labels and annotating innovative per-voxel flows. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth labels, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. Our data and code are available at https://uniocc.github.io/.
中文:UniOcc是一个统一的占用预测基准与工具包,整合了多源真实数据和仿真数据集,通过创新的三维流标注和无真值标签评估指标,证明大规模训练数据与显式流信息能显著提升占用预测性能。
English: UniOcc is a unified benchmark and toolkit for occupancy forecasting and prediction, integrating diverse real-world and simulated datasets with innovative 3D flow annotations and label-free evaluation metrics to significantly improve model performance through large-scale training data and explicit flow information.

Authors:Xiaoxuan Wang, Yihe Deng, Mingyu Derek Ma, Wei Wang
Title: Entropy-Based Adaptive Weighting for Self-Training
Abstract:
The mathematical problem-solving capabilities of large language models have become a focal point of research, with growing interests in leveraging self-generated reasoning paths as a promising way to refine and enhance these models. These paths capture step-by-step logical processes while requiring only the correct answer for supervision. The self-training method has been shown to be effective in reasoning tasks while eliminating the need for external models and manual annotations. However, optimizing the use of self-generated data for model training remains an open challenge. In this work, we propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty. This approach guides the model to focus on more informative and challenging examples, thereby enhancing its reasoning ability. We evaluate our approach on GSM8K and MATH benchmarks. Empirical results show that, while the vanilla method yields virtually no improvement (0%) on MATH, EAST achieves around a 1% gain over backbone model. On GSM8K, EAST attains a further 1-2% performance boost compared to the vanilla method.
Chinese: 本研究提出的基于熵的自适应加权自训练方法(EAST)通过优先处理模型不确定的自生成数据来增强大语言模型的数学推理能力,在MATH和GSM8K基准测试中分别实现了1%和1-2%的性能提升。
English: The proposed Entropy-Based Adaptive Weighting for Self-Training (EAST) method enhances large language models' mathematical reasoning by prioritizing uncertain self-generated data during training, achieving performance gains of 1% on MATH and 1-2% on GSM8K benchmarks compared to baseline methods.

Authors:Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di
Title: DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance
Abstract:
Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.
Chinese: 该多模态大语言模型框架通过多模态思维链推理技术,有效提升电影配音中对配音风格和说话者属性的适应能力,并在多个数据集上显著超越了现有最优方法的性能指标。
English: The proposed multi-modal large language model framework enhances movie dubbing by employing multimodal Chain-of-Thought reasoning to better adapt to dubbing styles and speaker attributes, achieving significant improvements in performance metrics over state-of-the-art methods.

Authors:Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang
Title: If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
Abstract:
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
中文: 大语言模型在多智能体交互中展现出涌现的终身学习能力,为此我们开发了LIFESTATE-BENCH基准测试,发现非参数方法在状态管理上表现更优,但所有模型都会随着交互延长出现灾难性遗忘问题。
English: Large language models exhibit emergent lifelong learning in multi-agent interactions, leading to the creation of LIFESTATE-BENCH, a benchmark that reveals nonparametric methods excel at state management but all models struggle with catastrophic forgetting over extended interactions.

Authors:Qiang Yi, Yangfan He, Jianhui Wang, Xinyuan Song, ShiYao Qian, Xinhang Yuan, Yi Xin, Yijin Wang, Jingqun Tang, Yuchen Li, Junjiang Lin, Hongyang He, Zhen Tian, Tianxiang Xu, Keqin Li, Kuan Lu, Menghao Huo, Jiaqi Chen, Miao Zhang, Tianyu Shi, Jianyuan Ni
Title: SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
Abstract:
Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach to identify related episodes and enhance the overall story structure. Experimental results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.
中文:SCORE框架通过检测叙事不一致性并采用检索增强生成技术来提升AI生成故事的结构连贯性,显著优于基准模型的表现。
English: The SCORE framework enhances AI-generated story coherence by detecting inconsistencies and using retrieval-augmented generation to improve narrative structure, significantly outperforming baseline models.

Authors:Alexander Murphy, Mohd Sanad Zaki Rizvi, Aden Haussmann, Ping Nie, Guifu Liu, Aryo Pradipta Gema, Pasquale Minervini
Title: An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering
Abstract:
Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.
中文: 将ReAct框架与DoLa等解码策略结合,显著提升了大型语言模型对检索信息的忠实度,从而在多跳问答任务(如HotpotQA)中提高了准确性。
English: Combining the ReAct framework with decoding strategies like DoLa significantly enhances LLMs' faithfulness to retrieved information, improving accuracy in multi-hop question answering tasks such as HotpotQA.

Authors:Hui Li, Ante Wang, kunquan li, Zhihao Wang, Liang Zhang, Delai Qiu, Qingsong Liu, Jinsong Su
Title: A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection
Abstract:
Misinformation spans various domains, but detection methods trained on specific domains often perform poorly when applied to others. With the rapid development of Large Language Models (LLMs), researchers have begun to utilize LLMs for cross-domain misinformation detection. However, existing LLM-based methods often fail to adequately analyze news in the target domain, limiting their detection capabilities. More importantly, these methods typically rely on manually designed decision rules, which are limited by domain knowledge and expert experience, thus limiting the generalizability of decision rules to different domains. To address these issues, we propose a MultiAgent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization (MARO). Under this framework, we first employs multiple expert agents to analyze target-domain news. Subsequently, we introduce a question-reflection mechanism that guides expert agents to facilitate higherquality analysis. Furthermore, we propose a decision rule optimization approach based on carefully-designed cross-domain validation tasks to iteratively enhance the effectiveness of decision rules in different domains. Experimental results and in-depth analysis on commonlyused datasets demonstrate that MARO achieves significant improvements over existing methods.
中文:提出的MARO多智能体框架通过采用带有问题反思机制的专家代理和跨领域验证优化决策规则,解决了跨领域虚假信息检测问题,相比现有方法取得了显著性能提升。
English: The proposed MultiAgent Framework with Automated Decision Rule Optimization (MARO) addresses cross-domain misinformation detection by employing expert agents with question-reflection mechanisms and optimizing decision rules through cross-domain validation, achieving superior performance over existing methods.

Authors:Yuni Lai, Yulin Zhu, Yixuan Sun, Yulun Wu, Bin Xiao, Gaolei Li, Jianhua Li, Kai Zhou
Title: AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks
Abstract:
Despite advancements in Graph Neural Networks (GNNs), adaptive attacks continue to challenge their robustness. Certified robustness based on randomized smoothing has emerged as a promising solution, offering provable guarantees that a model's predictions remain stable under adversarial perturbations within a specified range. However, existing methods face a critical trade-off between accuracy and robustness, as achieving stronger robustness requires introducing greater noise into the input graph. This excessive randomization degrades data quality and disrupts prediction consistency, limiting the practical deployment of certifiably robust GNNs in real-world scenarios where both accuracy and robustness are essential. To address this challenge, we propose \textbf{AuditVotes}, the first framework to achieve both high clean accuracy and certifiably robust accuracy for GNNs. It integrates randomized smoothing with two key components, \underline{au}gmentation and con\underline{dit}ional smoothing, aiming to improve data quality and prediction consistency. The augmentation, acting as a pre-processing step, de-noises the randomized graph, significantly improving data quality and clean accuracy. The conditional smoothing, serving as a post-processing step, employs a filtering function to selectively count votes, thereby filtering low-quality predictions and improving voting consistency. Extensive experimental results demonstrate that AuditVotes significantly enhances clean accuracy, certified robustness, and empirical robustness while maintaining high computational efficiency. Notably, compared to baseline randomized smoothing, AuditVotes improves clean accuracy by $437.1\%$ and certified accuracy by $409.3\%$ when the attacker can arbitrarily insert $20$ edges on the Cora-ML datasets, representing a substantial step toward deploying certifiably robust GNNs in real-world applications.
中文: AuditVotes是一种创新框架,通过将随机平滑与增强和条件平滑相结合,显著提升了图神经网络的清洁精度和认证鲁棒性,同时改善了数据质量和预测一致性。
English: AuditVotes is a novel framework that enhances both clean accuracy and certified robustness in Graph Neural Networks by integrating randomized smoothing with augmentation and conditional smoothing, significantly improving data quality and prediction consistency.

Authors:Ngoc Luyen Le, Marie-Hélène Abel
Title: From Individual to Group: Developing a Context-Aware Multi-Criteria Group Recommender System
Abstract:
Group decision-making is becoming increasingly common in areas such as education, dining, travel, and finance, where collaborative choices must balance diverse individual preferences. While conventional recommender systems are effective in personalization, they fall short in group settings due to their inability to manage conflicting preferences, contextual factors, and multiple evaluation criteria. This study presents the development of a Context-Aware Multi-Criteria Group Recommender System (CA-MCGRS) designed to address these challenges by integrating contextual factors and multiple criteria to enhance recommendation accuracy. By leveraging a Multi-Head Attention mechanism, our model dynamically weighs the importance of different features. Experiments conducted on an educational dataset with varied ratings and contextual variables demonstrate that CA-MCGRS consistently outperforms other approaches across four scenarios. Our findings underscore the importance of incorporating context and multi-criteria evaluations to improve group recommendations, offering valuable insights for developing more effective group recommender systems.
Chinese Summary: 本研究提出了一种情境感知多标准群组推荐系统(CA-MCGRS),通过多头注意力机制整合情境因素与多标准评估,有效提升群组决策推荐精度,实验证明其在多种场景下均优于现有方法。
English Summary: This study introduces a Context-Aware Multi-Criteria Group Recommender System (CA-MCGRS) that enhances recommendation accuracy for group decisions by integrating contextual factors and multiple criteria through a Multi-Head Attention mechanism, outperforming other methods in experiments.

Authors:Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah
Title: Reasoning Beyond Limits: Advances and Open Problems for LLMs
Abstract:
Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI's o1 & o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.
最近的生成式推理突破使大型语言模型能够通过动态检索信息和优化多步思维来解决复杂问题,本文分析了顶尖模型与训练方法,并探讨了无监督推理和工具集成等关键挑战。
Recent generative reasoning advances have enabled LLMs to solve complex problems through dynamic information retrieval and refined multi-step thinking, with this paper analyzing top models and training methods while addressing key challenges like unsupervised reasoning and tool integration.

Authors:Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di
Title: DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
Abstract:
Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.
中文: 当前多模态框架虽能有效从视频生成同步音频,但难以同时合成语音和音频,因此我们提出端到端的DeepAudio框架,通过整合视频到音频、文本到语音和模态融合模块,在多项基准测试中实现了最先进的性能。
English: Current multimodal frameworks effectively generate synchronized audio from videos but struggle with simultaneous speech and audio synthesis, prompting the development of DeepAudio, an end-to-end framework that achieves state-of-the-art performance across multiple benchmarks by integrating video-to-audio, text-to-speech, and modality fusion modules.

Authors:Xianqi Zhang, Hongliang Wei, Wenrui Wang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Title: FLAM: Foundation Model-Based Body Stabilization for Humanoid Locomotion and Manipulation
Abstract:
Humanoid robots have attracted significant attention in recent years. Reinforcement Learning (RL) is one of the main ways to control the whole body of humanoid robots. RL enables agents to complete tasks by learning from environment interactions, guided by task rewards. However, existing RL methods rarely explicitly consider the impact of body stability on humanoid locomotion and manipulation. Achieving high performance in whole-body control remains a challenge for RL methods that rely solely on task rewards. In this paper, we propose a Foundation model-based method for humanoid Locomotion And Manipulation (FLAM for short). FLAM integrates a stabilizing reward function with a basic policy. The stabilizing reward function is designed to encourage the robot to learn stable postures, thereby accelerating the learning process and facilitating task completion. Specifically, the robot pose is first mapped to the 3D virtual human model. Then, the human pose is stabilized and reconstructed through a human motion reconstruction model. Finally, the pose before and after reconstruction is used to compute the stabilizing reward. By combining this stabilizing reward with the task reward, FLAM effectively guides policy learning. Experimental results on a humanoid robot benchmark demonstrate that FLAM outperforms state-of-the-art RL methods, highlighting its effectiveness in improving stability and overall performance.
中文摘要:本文提出FLAM方法,通过结合稳定奖励函数与基础策略来提升人形机器人的运动与操作稳定性,实验证明该方法在性能和学习效率上均优于现有强化学习方法。
English Summary: This paper introduces FLAM, a foundation model-based method that enhances humanoid robot control by integrating a stabilizing reward function with basic policy to improve stability and accelerate learning, outperforming existing reinforcement learning techniques.

Authors:Yunming Liang, Zihao Chen, Chaofan Ding, Xinhan Di
Title: DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos
Abstract:
Currently, high-quality, synchronized audio is synthesized from video and optional text inputs using various multi-modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open-source video-audio and text-audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM) to enable step-by-step reasoning without requiring additional annotations. Additionally, a corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio and achieving competitive performance compared to various state-of-the-art models. The evaluation results show that the proposed method outperforms state-of-the-art approaches across multiple metrics. Specifically, the F DP aSST indicator is reduced by up to 10.07%, the F DP AN N s indicator by up to 11.62%, and the F DV GG indicator by up to 38.61%. Furthermore, the IS indicator improves by up to 4.95%, the IB-score indicator increases by up to 6.39%, and the DeSync indicator is reduced by up to 0.89%.
中文: 该框架利用多模态大语言模型的内在思维链从视频生成音频,无需额外标注即可实现更好的视听对齐,在多项评估指标上显著优于现有最优方法。
English: The proposed framework utilizes a multi-modal large language model's internal chain-of-thought to generate well-aligned audio from videos without extra annotations, significantly outperforming state-of-the-art methods across multiple evaluation metrics.

Authors:Haomin Zhang, Sizhe Shan, Haoyu Wang, Zihao Chen, Xiulong Liu, Chaofan Ding, Xinhan Di
Title: Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
Abstract:
Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, current state-of-the-art video-guided audio generation models often fall short of producing high-quality audio for both general and specialized use cases. To address this challenge, we introduce a multi-stage, multi-modal, end-to-end generative framework with Chain-of-Thought-like (CoT-like) guidance learning, termed Chain-of-Perform (CoP). First, we employ a transformer-based network architecture designed to achieve CoP guidance, enabling the generation of both general and professional audio. Second, we implement a multi-stage training framework that follows step-by-step guidance to ensure the generation of high-quality sound effects. Third, we develop a CoP multi-modal dataset, guided by video, to support step-by-step sound effects generation. Evaluation results highlight the advantages of the proposed multi-stage CoP generative framework compared to the state-of-the-art models on a variety of datasets, with FAD 0.79 to 0.74 (+6.33%), CLIP 16.12 to 17.70 (+9.80%) on VGGSound, SI-SDR 1.98dB to 3.35dB (+69.19%), MOS 2.94 to 3.49(+18.71%) on PianoYT-2h, and SI-SDR 2.22dB to 3.21dB (+44.59%), MOS 3.07 to 3.42 (+11.40%) on Piano-10h.
中文: 针对现有视频引导音频生成模型难以产生高质量音效的问题,我们提出具有逐步指导功能的多阶段链式表现框架,在多个数据集上显著超越了现有最优模型。
English: Current video-guided audio generation models often fail to produce high-quality sound effects, so we propose a multi-stage Chain-of-Perform (CoP) framework with step-by-step guidance that significantly outperforms state-of-the-art models across multiple datasets.

Authors:Armin Abdollahi, Mehdi Kamal, Massoud Pedram
Title: RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts
Abstract:
This paper presents RocketPPA, a novel ultra-fast power, performance (delay), and area (PPA) estimator operating directly at the code-level abstraction using HDL code as input. The key technical innovation is its LLM-based regression model, which uniquely integrates a large language model (LLM) with a mixture-of-experts (MoE) architecture composed of multilayer perceptrons (MLPs). The LLM interprets the input HDL code and then utilizes its final hidden-layer representations to predict PPA metrics. Low-rank adaptation (LoRA) is used for parameter-efficient fine-tuning to enable efficient LLM training. Furthermore, the work includes the development of an LLM-based HDL code repair framework to generate a large and synthesizable training dataset. Experimental results on the VerilogEval benchmark demonstrate that RocketPPA achieves significant improvements in the accuracy of PPA estimation compared to previous state-of-the-art methods like Llama3-MetRex-8B. Specifically, at a 10% relative error threshold, RocketPPA enhances the pass rate for area prediction by 13.6%, delay by 9.4%, and power by 14.7%. At a 20% threshold, the improvements are 9.6% for area, 10.8% for delay, and 18.5% for power. Moreover, RocketPPA achieves a speedup of over 20x compared to MetRex and 30x over MasterRTL in processing the test set. The impact of RocketPPA is the potential to substantially accelerate the hardware design process by providing accurate PPA estimations early in the design cycle, thus avoiding the overhead of manual feature engineering and time-consuming synthesis flows.
中文: RocketPPA是一种基于LLM-MoE架构的超高速PPA估算器,可直接分析HDL代码,在功耗、延迟和面积预测上比现有方法提速20倍以上且精度显著提升。
English: RocketPPA is an ultra-fast code-level PPA estimator that uses an LLM-MoE architecture to directly analyze HDL code, achieving over 20x speedup and significantly higher accuracy than previous methods across power, delay, and area predictions.

Authors:Weihao Yu, Yuanhao Cai, Ruyi Zha, Zhiwen Fan, Chenxin Li, Yixuan Yuan
Title: X$^{2}$-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
Abstract:
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: https://x2-gaussian.github.io/.
Chinese: X²-高斯框架通过将动态辐射高斯泼溅与自监督呼吸运动学习相结合,实现了连续时间四维CT重建,消除了相位离散化和外部门控设备的依赖,并在现有方法基础上取得了显著的性能提升。
English: The X²-Gaussian framework enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning, eliminating phase discretization and external gating devices while achieving significant performance improvements over existing methods.

Authors:Jiefu Ou, William Gantt Walden, Kate Sanders, Zhengping Jiang, Kaiser Sun, Jeffrey Cheng, William Jurayj, Miriam Wanner, Shaobo Liang, Candice Morgan, Seunghoon Han, Weiqi Wang, Chandler May, Hannah Recknor, Daniel Khashabi, Benjamin Van Durme
Title: CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?
Abstract:
A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.
中文: CLAIMCHECK是一个新数据集,用于评估大语言模型在批判科学主张方面的能力,结果显示尽管先进模型能预测弱点标签,但在关联弱点与主张及验证论文论断方面仍不及人类专家。
English: CLAIMCHECK is a new dataset designed to benchmark large language models on their ability to critique scientific claims, revealing that while advanced models can predict weakness labels, they still fall short of human expertise in associating weaknesses with claims and verifying paper assertions.

Authors:Guang Zhao, Xihaier Luo, Seungjun Lee, Yihui Ren, Shinjae Yoo, Luke Van Roekel, Balu Nadiga, Sri Hari Krishna Narayanan, Yixuan Sun, Wei Xu
Title: Generalizable Implicit Neural Representations via Parameterized Latent Dynamics for Baroclinic Ocean Forecasting
Abstract:
Mesoscale ocean dynamics play a critical role in climate systems, governing heat transport, hurricane genesis, and drought patterns. However, simulating these processes at high resolution remains computationally prohibitive due to their nonlinear, multiscale nature and vast spatiotemporal domains. Implicit neural representations (INRs) reduce the computational costs as resolution-independent surrogates but fail in many-query scenarios (inverse modeling) requiring rapid evaluations across diverse parameters. We present PINROD, a novel framework combining dynamics-aware implicit neural representations with parameterized neural ordinary differential equations to address these limitations. By integrating parametric dependencies into latent dynamics, our method efficiently captures nonlinear oceanic behavior across varying boundary conditions and physical parameters. Experiments on ocean mesoscale activity data show superior accuracy over existing baselines and improved computational efficiency compared to standard numerical simulations.
中文摘要:PINROD是一种创新框架,通过将动态感知隐式神经表示与参数化神经常微分方程相结合,有效模拟不同参数下的非线性海洋中尺度动态,实验证明其具有卓越的精度和计算效率。
English Summary: PINROD is a novel framework that integrates dynamics-aware implicit neural representations with parameterized neural ordinary differential equations to efficiently simulate nonlinear ocean mesoscale dynamics across varying parameters, demonstrating superior accuracy and computational efficiency in experiments.

Authors:Ningyu He, Shangtong Cao, Haoyu Wang, Yao Guo, Xiapu Luo
Title: The Promise and Pitfalls of WebAssembly: Perspectives from the Industry
Abstract:
As JavaScript has been criticized for performance and security issues in web applications, WebAssembly (Wasm) was proposed in 2017 and is regarded as the complementation for JavaScript. Due to its advantages like compact-size, native-like speed, and portability, Wasm binaries are gradually used as the compilation target for industrial projects in other high-level programming languages and are responsible for computation-intensive tasks in browsers, e.g., 3D graphic rendering and video decoding. Intuitively, characterizing in-the-wild adopted Wasm binaries from different perspectives, like their metadata, relation with source programming language, existence of security threats, and practical purpose, is the prerequisite before delving deeper into the Wasm ecosystem and beneficial to its roadmap selection. However, currently, there is no work that conducts a large-scale measurement study on in-the-wild adopted Wasm binaries. To fill this gap, we collect the largest-ever dataset to the best of our knowledge, and characterize the status quo of them from industry perspectives. According to the different roles of people engaging in the community, i.e., web developers, Wasm maintainers, and researchers, we reorganized our findings to suggestions and best practices for them accordingly. We believe this work can shed light on the future direction of the web and Wasm.
中文: WebAssembly(Wasm)作为JavaScript的高性能补充,用于处理网页应用中的计算密集型任务,本研究首次对实际使用的Wasm二进制文件进行大规模分析,为开发者、维护者和研究人员提供见解和最佳实践,以指引未来发展。
English: WebAssembly (Wasm) serves as a high-performance complement to JavaScript, enabling computation-intensive tasks in web applications, and this study provides the first large-scale analysis of in-the-wild Wasm binaries, offering insights and best practices for developers, maintainers, and researchers to guide future development.

Authors:Jinwei Qi, Chaonan Ji, Sheng Xu, Peng Zhang, Bang Zhang, Liefeng Bo
Title: ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
Abstract:
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
Chinese: 本文提出了一种新颖的实时风格化人像视频生成框架,通过采用包含分层运动扩散模型和显式手势控制信号的两阶段方法,解决了头部与身体运动同步及表情控制的难题,实现了30fps下富有表现力的上半身视频聊天。
English: This paper introduces a novel framework for stylized real-time portrait video generation that overcomes limitations in synchronizing head and body movements and controlling facial expressions by employing a two-stage process with hierarchical motion diffusion models and explicit hand control signals, enabling expressive upper-body video chat at 30fps.

Authors:Xiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao Li
Title: Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction
Abstract:
Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.
中文: 本研究通过众包的谷歌地图评论和GPT模型分析发现,人际因素和运营效率是紧急护理患者满意度的主要决定因素,人口密度是唯一呈现适度相关的社会经济因素,证明了众包方法在医疗改进中的价值。
English: This study utilizes crowdsourced Google Maps reviews and GPT model analysis to reveal that interpersonal factors and operational efficiency are the primary drivers of patient satisfaction in urgent care, with population density being the only socioeconomic factor showing modest correlation, demonstrating crowdsourcing's value in healthcare improvement.

Authors:Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut
Title: VinaBench: Benchmark for Faithful and Consistent Visual Narratives
Abstract:
Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
中文: 视觉叙事生成面临文本忠实性和图像间一致性的挑战,VinaBench通过引入常识与语篇约束作为系统性支架,有效提升了生成叙事图像的真实性与连贯性。
English: Visual narrative generation faces challenges in maintaining text faithfulness and self-consistency, which VinaBench addresses by incorporating commonsense and discourse constraints to improve the quality of generated image sequences.

Authors:Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, Tae-Hyun Oh
Title: Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Abstract:
Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.
中文总结:本研究提出了一种语音-网格同步表示法,通过感知损失函数提升三维说话头生成在时间同步性、唇部可读性和表现力三个关键维度的性能。
English Summary: This research introduces a speech-mesh synchronized representation to enhance 3D talking head generation by improving temporal synchronization, lip readability, and expressiveness through a perceptual loss function.

Authors:Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can Yaras
Title: An Overview of Low-Rank Structures in the Training and Adaptation of Large Models
Abstract:
The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon, such as low-rank adaptation (LoRA) and training, enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning.
中文摘要:深度网络中低秩结构的出现为LoRA等高效训练与微调技术提供了基础,在降低计算成本的同时保持模型性能,其数学原理基于优化动力学和隐式正则化效应。
English Summary: The emergence of low-rank structures in deep networks enables efficient training and fine-tuning techniques like LoRA, reducing computational costs while maintaining performance, with mathematical foundations in optimization dynamics and implicit regularization.

Authors:Eshed Gal, Moshe Eliasof, Carola-Bibiane Schönlieb, Ivan I. Kyrchei, Eldad Haber, Eran Treister
Title: Towards Efficient Training of Graph Neural Networks: A Multiscale Approach
Abstract:
Graph Neural Networks (GNNs) have become powerful tools for learning from graph-structured data, finding applications across diverse domains. However, as graph sizes and connectivity increase, standard GNN training methods face significant computational and memory challenges, limiting their scalability and efficiency. In this paper, we present a novel framework for efficient multiscale training of GNNs. Our approach leverages hierarchical graph representations and subgraphs, enabling the integration of information across multiple scales and resolutions. By utilizing coarser graph abstractions and subgraphs, each with fewer nodes and edges, we significantly reduce computational overhead during training. Building on this framework, we propose a suite of scalable training strategies, including coarse-to-fine learning, subgraph-to-full-graph transfer, and multiscale gradient computation. We also provide some theoretical analysis of our methods and demonstrate their effectiveness across various datasets and learning tasks. Our results show that multiscale training can substantially accelerate GNN training for large scale problems while maintaining, or even improving, predictive performance.
中文: 本文提出了一种新颖的图神经网络多尺度训练框架,通过利用层次化图表示和子图结构,在保持甚至提升预测性能的同时显著降低了大规模图数据训练的计算开销。
English: This paper introduces a novel multiscale training framework for Graph Neural Networks that uses hierarchical graph representations to reduce computational costs while maintaining or enhancing predictive performance across various tasks.

Authors:Zhiping Xiao, Xinyu Wang, Yifang Qin, Zijie Huang, Mason A. Porter, Yizhou Sun
Title: A Social Dynamical System for Twitter Analysis
Abstract:
Understanding the evolution of public opinion is crucial for informed decision-making in various domains, particularly public affairs. The rapid growth of social networks, such as Twitter (now rebranded as X), provides an unprecedented opportunity to analyze public opinion at scale without relying on traditional surveys. With the rise of deep learning, Graph Neural Networks (GNNs) have shown great promise in modeling online opinion dynamics. Notably, classical opinion dynamics models, such as DeGroot, can be reformulated within a GNN framework. We introduce Latent Social Dynamical System (LSDS), a novel framework for modeling the latent dynamics of social media users' opinions based on textual content. Since expressed opinions may not fully reflect underlying beliefs, LSDS first encodes post content into latent representations. It then leverages a GraphODE framework, using a GNN-based ODE function to predict future opinions. A decoder subsequently utilizes these predicted latent opinions to perform downstream tasks, such as interaction prediction, which serve as benchmarks for model evaluation. Our framework is highly flexible, supporting various opinion dynamic models as ODE functions, provided they can be adapted into a GNN-based form. It also accommodates different encoder architectures and is compatible with diverse downstream tasks. To validate our approach, we constructed dynamic datasets from Twitter data. Experimental results demonstrate the effectiveness of LSDS, highlighting its potential for future applications. We plan to publicly release our dataset and code upon the publication of this paper.
中文摘要:LSDS框架通过编码用户帖子内容,利用GraphODE预测潜在观点动态,并应用于互动预测等下游任务,在Twitter数据上验证了其有效性。
English Summary: The LSDS framework models latent opinion dynamics on social media by encoding user posts, predicting future opinions through GraphODE, and using these for tasks like interaction prediction, demonstrating effectiveness on Twitter data.

Authors:Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang, Ruida Zhang, Xiangyang Ji, Marc Pollefeys, Francis Engelmann
Title: Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Abstract:
We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at https://openfungraph.github.io
Chinese: 本研究提出一种从RGB-D图像预测功能性3D场景图的方法,利用基础模型编码功能知识,在3D问答和机器人操作等任务中显著优于现有基线方法。
English: This study introduces a method for predicting functional 3D scene graphs from RGB-D images, utilizing foundation models to encode functional knowledge and demonstrating superior performance over existing baselines in tasks like 3D question answering and robotic manipulation.

Authors:Chayan Banerjee, Kien Nguyen, Clinton Fookes
Title: Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling
Abstract:
Mining process optimization particularly truck dispatch scheduling is a critical factor in enhancing the efficiency of open pit mining operations However the dynamic and stochastic nature of mining environments characterized by uncertainties such as equipment failures truck maintenance and variable haul cycle times poses significant challenges for traditional optimization methods While Reinforcement Learning RL has shown promise in adaptive decision making for mining logistics its practical deployment requires rigorous evaluation in realistic and customizable simulation environments The lack of standardized benchmarking environments limits fair algorithm comparisons reproducibility and the real world applicability of RL based approaches in open pit mining settings To address this challenge we introduce Mining Gym a configurable open source benchmarking environment designed for training testing and comparing RL algorithms in mining process optimization Built on Discrete Event Simulation DES and seamlessly integrated with the OpenAI Gym interface Mining Gym provides a structured testbed that enables the direct application of advanced RL algorithms from Stable Baselines The framework models key mining specific uncertainties such as equipment failures queue congestion and the stochasticity of mining processes ensuring a realistic and adaptive learning environment Additionally Mining Gym features a graphical user interface GUI for intuitive mine site configuration a comprehensive data logging system a built in KPI dashboard and real time visual representation of the mine site These capabilities facilitate standardized reproducible evaluations across multiple RL strategies and baseline heuristics
中文摘要:针对露天矿卡车调度优化中传统方法难以应对动态不确定性的问题,本文提出可配置开源基准环境Mining Gym,通过离散事件仿真和实时可视化工具,为强化学习算法提供包含设备故障、队列拥堵等真实矿场不确定性的标准化测试平台。
English Summary: Mining Gym is introduced as a configurable open-source benchmarking environment that enables standardized training and evaluation of reinforcement learning algorithms for optimizing truck dispatch in open-pit mining, addressing challenges posed by dynamic mining conditions through realistic simulation and integrated analytical tools.

Authors:Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
Title: Your ViT is Secretly an Image Segmentation Model
Abstract:
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.
中文: 仅编码器掩码变换器(EoMT)利用大规模视觉变换器实现图像分割,其精度与最先进方法相当,同时因架构简化而显著提升了速度。
English: The Encoder-only Mask Transformer (EoMT) leverages large-scale Vision Transformers to achieve image segmentation with accuracy comparable to state-of-the-art methods while being significantly faster due to its simplified architecture.

Authors:Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Jiaao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, Chunchao Guo
Title: RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
Abstract:
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
中文: RomanTex是一种创新的多视图纹理生成框架,通过结合三维表示与多重注意力网络,并引入三维感知位置编码和几何对齐引导机制,有效解决了现有方法中的不一致性问题,实现了纹理质量和一致性的最先进水平。
English: RomanTex is a novel multiview texture generation framework that integrates 3D representations with multi-attention networks to overcome inconsistencies in existing methods, achieving state-of-the-art texture quality and consistency through innovative mechanisms like 3D-aware positional embedding and geometry-aligned guidance.

Authors:Jian Ma, Xinchen Lyu, Jun Jiang, Qimei Cui, Haipeng Yao, Xiaofeng Tao
Title: SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices
Abstract:
Fine-tuning large language models (LLMs) on private, on-device data can empower tailored personalized AI agents. However, fine-tuning LLMs on resource-constrained edge devices faces significant challenges, including excessive computation overhead, device heterogeneity, and data imbalance. This paper proposes SplitFrozen, a split learning framework that enables efficient LLM fine-tuning by strategically freezing device-side model layers while centralizing parameter-efficient fine-tuning on the server. Our framework partitions LLMs into device-side frozen layers and server-side fine-tuning layers, where heterogeneous resource-constrained devices execute only forward propagation. To minimize server-side training costs, we integrate Low-Rank Adaptation (LoRA) into the server-side layers. A pipeline parallelism strategy further optimizes training efficiency by decoupling device-server computations and leveraging decomposed backward propagation. Experiments on GPT-2 with the MRPC, MNLI-matched, and SST-2 datasets demonstrate that SplitFrozen outperforms FedLoRA and SplitLoRA by 69.4\% model accuracy under extremely imbalanced data, while reducing up to 86.8\% device-side computations and 50.2\% total training time. Experiments also validate the scalability of SplitFrozen on content generation task using Llama-3.2 model on GSM8K dataset.
Chinese: 本文提出SplitFrozen框架,通过冻结设备端层并在服务器端集中进行参数高效微调,实现在资源受限边缘设备上高效微调大语言模型,在显著提高精度的同时大幅减少计算开销和训练时间。
English: This paper introduces SplitFrozen, a split learning framework that enables efficient fine-tuning of large language models on resource-constrained edge devices by freezing device-side layers and centralizing parameter-efficient fine-tuning on the server, significantly improving accuracy while reducing computational overhead and training time.

Authors:Karim Abou Zeid, Kadir Yilmaz, Daan de Geus, Alexander Hermans, David Adrian, Timm Linder, Bastian Leibe
Title: DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation
Abstract:
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we further propose to distill 2D foundation models into a 3D backbone as a pretraining task. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.
中文: 视觉基础模型(VFMs)虽在二维视觉识别中表现出色,但其在三维视觉中的应用尚未充分开发;本文提出的DITR方法通过提取并融合二维特征到三维点云分割中,实现了领先性能,并利用蒸馏技术在没有图像时也能提升模型表现。
English: Vision foundation models (VFMs) offer powerful 2D features that remain underutilized in 3D vision, leading to the development of DITR, which integrates these features into 3D point cloud segmentation and achieves state-of-the-art results, with a distillation method further enhancing performance when images are unavailable.

Authors:Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
Title: Video-T1: Test-Time Scaling for Video Generation
Abstract:
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1
中文: 本研究探索了视频生成中的测试时扩展方法,通过线性搜索和树状帧结构等策略优化推理计算,实验证明增加测试时计算能显著提升视频生成质量。
English: This study explores Test-Time Scaling (TTS) for video generation, introducing methods like linear search and Tree-of-Frames (ToF) to enhance video quality by optimizing inference-time computation, with experiments confirming significant improvements from increased test-time compute.

Authors:Danish Nisar, Saif Khan Mohammed, Ronny Hadani, Ananthanarayanan Chockalingam, Robert Calderbank
Title: Zak-OTFS for Identification of Linear Time-Varying Systems
Abstract:
Linear time-varying (LTV) systems model radar scenes where each reflector/target applies a delay, Doppler shift and complex amplitude scaling to a transmitted waveform. The receiver processes the received signal using the transmitted signal as a reference. The self-ambiguity function of the transmitted signal captures the cross-correlation of delay and Doppler shifts of the transmitted waveform. It acts as a blur that limits resolution, at the receiver, of the delay and Doppler shifts of targets in close proximity. This paper considers resolution of multiple targets and compares performance of traditional chirp waveforms with the Zak-OTFS waveform. The self-ambiguity function of a chirp is a line in the delay-Doppler domain, whereas the self-ambiguity function of the Zak-OTFS waveform is a lattice. The advantage of lattices over lines is better localization, and we show lattices provide superior noise-free estimation of the range and velocity of multiple targets. When the delay spread of the radar scene is less than the delay period of the Zak-OTFS modulation, and the Doppler spread is less than the Doppler period, we describe how to localize targets by calculating cross-ambiguities in the delay-Doppler domain. We show that the signal processing complexity of our approach is superior to the traditional approach of computing cross-ambiguities in the continuous time / frequency domain.
中文: 本文证明Zak-OTFS波形的晶格状自模糊函数在时延-多普勒域中比传统线性调频波形具有更优的多目标定位分辨率,同时显著降低了信号处理的计算复杂度。
English: This paper demonstrates that the Zak-OTFS waveform's lattice-shaped self-ambiguity function provides superior resolution for localizing multiple targets in delay-Doppler domains compared to traditional chirp waveforms, while also reducing computational complexity.

Authors:Samuel Rota Bulò, Nemanja Bartolovic, Lorenzo Porzi, Peter Kontschieder
Title: Hardware-Rasterized Ray-Based Gaussian Splatting
Abstract:
We present a novel, hardware rasterized rendering approach for ray-based 3D Gaussian Splatting (RayGS), obtaining both fast and high-quality results for novel view synthesis. Our work contains a mathematically rigorous and geometrically intuitive derivation about how to efficiently estimate all relevant quantities for rendering RayGS models, structured with respect to standard hardware rasterization shaders. Our solution is the first enabling rendering RayGS models at sufficiently high frame rates to support quality-sensitive applications like Virtual and Mixed Reality. Our second contribution enables alias-free rendering for RayGS, by addressing MIP-related issues arising when rendering diverging scales during training and testing. We demonstrate significant performance gains, across different benchmark scenes, while retaining state-of-the-art appearance quality of RayGS.
中文: 本文提出了一种基于硬件光栅化的射线3D高斯泼溅渲染方法,实现了快速高质量的新视角合成和无走样渲染,能够在保持顶尖视觉效果的同时满足虚拟/混合现实的实时性能需求。
English: This paper introduces a hardware rasterized rendering method for ray-based 3D Gaussian Splatting that achieves fast, high-quality novel view synthesis and alias-free rendering, enabling real-time performance for VR/MR applications while maintaining state-of-the-art visual quality.

Authors:Takehiro Imamura, Yuka Hashizume, Wen-Chin Huang, Tomoki Toda
Title: Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference
Abstract:
This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.
本文提出了一种基于乐器声音的音乐相似性表示学习方法(InMSRL),利用音乐源分离和人类偏好,并通过端到端微调、多任务学习和感知感知微调三项技术,显著提升了特征提取和感知对齐的性能。
This paper introduces InMSRL, a method for learning music similarity from individual instrument sounds using source separation and human preference, and proposes three techniques—end-to-end fine-tuning, multi-task learning, and perception-aware fine-tuning—that significantly enhance performance in feature extraction and perceptual alignment.

Authors:Dawit Ketema Gete, Bedru Yimam Ahmed, Tadesse Destaw Belay, Yohannes Ayana Ejigu, Sukairaj Hafiz Imam, Alemu Belay Tessema, Mohammed Oumer Adem, Tadesse Amare Belay, Robert Geislinger, Umma Aliyu Musa, Martin Semmann, Shamsuddeen Hassan Muhammad, Henning Schreiber, Seid Muhie Yimam
Title: Whispering in Amharic: Fine-tuning Whisper for Low-resource Language
Abstract:
This work explores fine-tuning OpenAI's Whisper automatic speech recognition (ASR) model for Amharic, a low-resource language, to improve transcription accuracy. While the foundational Whisper model struggles with Amharic due to limited representation in its training data, we fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset. The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets. Training solely on new data leads to poor performance, but combining it with FLEURS data reinforces the model, enabling better specialization in Amharic. We also demonstrate that normalizing Amharic homophones significantly enhances Word Error Rate (WER) and Bilingual Evaluation Understudy (BLEU) scores. This study underscores the importance of fine-tuning strategies and dataset composition for improving ASR in low-resource languages, providing insights for future Amharic speech recognition research.
中文摘要:本研究通过微调OpenAI的Whisper模型进行阿姆哈拉语语音识别,发现结合现有FLEURS数据与新数据集能显著提升转录准确率,同时同音词归一化处理有效改善了词错误率和BLEU评分。
English Summary: This study fine-tunes OpenAI's Whisper model for Amharic speech recognition, demonstrating that combining existing FLEURS data with new datasets significantly improves transcription accuracy and that homophone normalization enhances performance metrics.

Authors:Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Grigori Sidorov, Seid Muhie Yimam
Title: Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages
Abstract:
In this digital world, people freely express their emotions using different social media platforms. As a result, modeling and integrating emotion-understanding models are vital for various human-computer interaction tasks such as decision-making, product and customer feedback analysis, political promotions, marketing research, and social media monitoring. As users express different emotions simultaneously in a single instance, annotating emotions in a multilabel setting such as the EthioEmo (Belay et al., 2025) dataset effectively captures this dynamic. Additionally, incorporating intensity, or the degree of emotion, is crucial, as emotions can significantly differ in their expressive strength and impact. This intensity is significant for assessing whether further action is necessary in decision-making processes, especially concerning negative emotions in applications such as healthcare and mental health studies. To enhance the EthioEmo dataset, we include annotations for the intensity of each labeled emotion. Furthermore, we evaluate various state-of-the-art encoder-only Pretrained Language Models (PLMs) and decoder-only Large Language Models (LLMs) to provide comprehensive benchmarking.
中文: 在数字时代,情感建模对人机交互至关重要,增强版EthioEmo数据集新增了情感强度标注,并通过评估先进语言模型提供了全面的基准测试。
English: In the digital era, modeling emotions is essential for human-computer interaction, and the enhanced EthioEmo dataset now includes emotion intensity annotations alongside evaluations of advanced language models for comprehensive benchmarking.

Authors:Jiachen Jiang, Tianyu Ding, Ke Zhang, Jinxin Zhou, Tianyi Chen, Ilya Zharkov, Zhihui Zhu, Luming Liang
Title: Cat-AIR: Content and Task-Aware All-in-One Image Restoration
Abstract:
All-in-one image restoration seeks to recover high-quality images from various types of degradation using a single model, without prior knowledge of the corruption source. However, existing methods often struggle to effectively and efficiently handle multiple degradation types. We present Cat-AIR, a novel \textbf{C}ontent \textbf{A}nd \textbf{T}ask-aware framework for \textbf{A}ll-in-one \textbf{I}mage \textbf{R}estoration. Cat-AIR incorporates an alternating spatial-channel attention mechanism that adaptively balances the local and global information for different tasks. Specifically, we introduce cross-layer channel attentions and cross-feature spatial attentions that allocate computations based on content and task complexity. Furthermore, we propose a smooth learning strategy that allows for seamless adaptation to new restoration tasks while maintaining performance on existing ones. Extensive experiments demonstrate that Cat-AIR achieves state-of-the-art results across a wide range of restoration tasks, requiring fewer FLOPs than previous methods, establishing new benchmarks for efficient all-in-one image restoration.
Chinese: Cat-AIR提出了一种内容与任务感知的框架,通过交替空间-通道注意机制和平滑学习策略,在多种图像修复任务中以更低计算成本实现了高效的先进性能。
English: Cat-AIR introduces a content and task-aware framework with alternating spatial-channel attention and a smooth learning strategy, achieving state-of-the-art, efficient all-in-one image restoration across various tasks with reduced computational costs.

Authors:Mingyue Yuan, Jieshan Chen, Zhenchang Xing, Gelareh Mohammadi, Aaron Quigley
Title: A Case Study of Scalable Content Annotation Using Multi-LLM Consensus and Human Review
Abstract:
Content annotation at scale remains challenging, requiring substantial human expertise and effort. This paper presents a case study in code documentation analysis, where we explore the balance between automation efficiency and annotation accuracy. We present MCHR (Multi-LLM Consensus with Human Review), a novel semi-automated framework that enhances annotation scalability through the systematic integration of multiple LLMs and targeted human review. Our framework introduces a structured consensus-building mechanism among LLMs and an adaptive review protocol that strategically engages human expertise. Through our case study, we demonstrate that MCHR reduces annotation time by 32% to 100% compared to manual annotation while maintaining high accuracy (85.5% to 98%) across different difficulty levels, from basic binary classification to challenging open-set scenarios.
中文摘要:本文提出MCHR半自动化框架,通过整合多LLM共识机制与定向人工审核,在代码文档标注任务中将标注时间减少32%-100%,同时在不同难度场景下保持85.5%-98%的准确率。
English Summary: This paper introduces MCHR, a semi-automated framework combining multiple LLMs with strategic human review to significantly reduce annotation time by 32-100% while maintaining 85.5-98% accuracy across various code documentation tasks.

Authors:Minsu Kim, Jiayao Gu, Ye Yuan, Taeyoung Yun, Zixuan Liu, Yoshua Bengio, Can Chen
Title: Offline Model-Based Optimization: Comprehensive Review
Abstract:
Offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking(reward hacking), exploiting model inaccuracies in unseen regions, or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model-based optimization(MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.
中文: 离线优化利用离线数据集优化黑盒函数,避免高成本查询,近期基于模型的方法在应对分布外问题方面更稳健,推动了蛋白质工程等领域的发现。
English: Offline optimization tackles black-box function optimization using offline datasets to avoid costly queries, with recent advances in model-based methods improving robustness against out-of-distribution issues and accelerating discoveries in fields like protein engineering.

Authors:Fanghua Yu, Jinjin Gu, Jinfan Hu, Zheyuan Li, Chao Dong
Title: UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models
Abstract:
We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image conditional generation tasks, UniCon has demonstrated precise responsiveness to control inputs and exceptional generation capabilities.
中文: UniCon采用单向架构提升大规模扩散模型适配器训练效率,通过降低计算需求实现GPU内存减少三分之一、训练速度提升2.3倍,在保持性能的同时支持更大参数容量的适配器训练。
English: UniCon introduces a unidirectional architecture that enhances adapter training efficiency for large-scale diffusion models by reducing computational demands, achieving a one-third GPU memory reduction and 2.3 times faster training while maintaining performance and enabling larger adapter capacities.

Authors:Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong
Title: Radar-Guided Polynomial Fitting for Metric Depth Estimation
Abstract:
We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust depth predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.
Chinese: POLAR提出了一种雷达引导的多项式拟合方法,将无尺度深度预测转换为精确度量地图,通过校正单目深度估计中的错位问题,实现了卓越的性能和效率。
English: POLAR introduces a radar-guided polynomial fitting method to transform scaleless depth predictions into accurate metric maps, correcting misalignments in monocular depth estimation with superior performance and efficiency.

Authors:Gaojie Jin, Tianjin Huang, Ronghui Mu, Xiaowei Huang
Title: Principal Eigenvalue Regularization for Improved Worst-Class Certified Robustness of Smoothed Classifiers
Abstract:
Recent studies have identified a critical challenge in deep neural networks (DNNs) known as ``robust fairness", where models exhibit significant disparities in robust accuracy across different classes. While prior work has attempted to address this issue in adversarial robustness, the study of worst-class certified robustness for smoothed classifiers remains unexplored. Our work bridges this gap by developing a PAC-Bayesian bound for the worst-class error of smoothed classifiers. Through theoretical analysis, we demonstrate that the largest eigenvalue of the smoothed confusion matrix fundamentally influences the worst-class error of smoothed classifiers. Based on this insight, we introduce a regularization method that optimizes the largest eigenvalue of smoothed confusion matrix to enhance worst-class accuracy of the smoothed classifier and further improve its worst-class certified robustness. We provide extensive experimental validation across multiple datasets and model architectures to demonstrate the effectiveness of our approach.
Chinese: 本研究针对深度神经网络中的鲁棒公平性挑战,通过为平滑分类器的最差类别误差开发PAC-Bayesian边界,提出了一种优化平滑混淆矩阵最大特征值的正则化方法,有效提升了最差类别准确性和认证鲁棒性。
English: This study addresses the challenge of robust fairness in deep neural networks by developing a PAC-Bayesian bound for worst-class error in smoothed classifiers, introducing a regularization method that optimizes the largest eigenvalue of the smoothed confusion matrix to enhance worst-class accuracy and certified robustness.

Authors:Weihao Yu, Xiaoqing Guo, Chenxin Li, Yifan Liu, Yixuan Yuan
Title: GeoT: Geometry-guided Instance-dependent Transition Matrix for Semi-supervised Tooth Point Cloud Segmentation
Abstract:
Achieving meticulous segmentation of tooth point clouds from intra-oral scans stands as an indispensable prerequisite for various orthodontic applications. Given the labor-intensive nature of dental annotation, a significant amount of data remains unlabeled, driving increasing interest in semi-supervised approaches. One primary challenge of existing semi-supervised medical segmentation methods lies in noisy pseudo labels generated for unlabeled data. To address this challenge, we propose GeoT, the first framework that employs instance-dependent transition matrix (IDTM) to explicitly model noise in pseudo labels for semi-supervised dental segmentation. Specifically, to handle the extensive solution space of IDTM arising from tens of thousands of dental points, we introduce tooth geometric priors through two key components: point-level geometric regularization (PLGR) to enhance consistency between point adjacency relationships in 3D and IDTM spaces, and class-level geometric smoothing (CLGS) to leverage the fixed spatial distribution of tooth categories for optimal IDTM estimation. Extensive experiments performed on the public Teeth3DS dataset and private dataset demonstrate that our method can make full utilization of unlabeled data to facilitate segmentation, achieving performance comparable to fully supervised methods with only $20\%$ of the labeled data.
中文摘要:GeoT框架通过结合实例相关转移矩阵与牙齿几何先验知识,有效解决半监督牙齿分割中的伪标签噪声问题,仅用20%标注数据即可达到接近全监督方法的性能。
English Summary: The proposed GeoT framework addresses noisy pseudo labels in semi-supervised dental segmentation by incorporating instance-dependent transition matrices with tooth geometric priors, achieving near-fully-supervised performance using only 20% labeled data.

Authors:Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, Chaeyoung Jung, Joon Son Chung
Title: From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Abstract:
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
本研究提出了一种新颖的视频到语音合成系统,通过内容、音色和韵律建模学习分层表征,有效弥合了无声说话人脸视频与语音之间的模态差距,实现了卓越的语音生成质量。
This study introduces a novel video-to-speech synthesis system that bridges the modality gap between silent talking face videos and speech by learning hierarchical representations through content, timbre, and prosody modeling, achieving superior speech quality.

Authors:Zeqing He, Zhibo Wang, Huiyu Xu, Kui Ren
Title: Towards LLM Guardrails via Sparse Representation Steering
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in natural language generation tasks, yet their uncontrolled outputs pose significant ethical and safety risks. Recently, representation engineering methods have shown promising results in steering model behavior by modifying the rich semantic information encoded in activation vectors. However, due to the difficulty of precisely disentangling semantic directions within high-dimensional representation space, existing approaches suffer from three major limitations: lack of fine-grained control, quality degradation of generated content, and poor interpretability. To address these challenges, we propose a sparse encoding-based representation engineering method, named SRE, which decomposes polysemantic activations into a structured, monosemantic feature space. By leveraging sparse autoencoding, our approach isolates and adjusts only task-specific sparse feature dimensions, enabling precise and interpretable steering of model behavior while preserving content quality. We validate our method on three critical domains, i.e., safety, fairness, and truthfulness using the open-source LLM Gemma-2-2B-it. Experimental results show that SRE achieves superior controllability while maintaining the overall quality of generated content (i.e., controllability and quality), demonstrating its effectiveness as a fine-grained and interpretable activation steering framework.
中文: 本研究提出的稀疏表征工程方法通过稀疏自编码将多义激活分解为结构化特征空间,实现了对大型语言模型行为的精细可控调节,在保持生成质量的同时显著提升了安全、公平和真实性三个关键领域的控制效果。
English: The proposed Sparse Representation Engineering (SRE) method addresses limitations in existing representation engineering by using sparse autoencoding to precisely control LLM behavior through task-specific feature adjustments, achieving superior controllability while maintaining content quality across safety, fairness, and truthfulness domains.

Authors:Xiaoran Zhang, Byung-Woo Hong, Hyoungseob Park, Daniel H. Pak, Anne-Marie Rickmann, Lawrence H. Staib, James S. Duncan, Alex Wong
Title: Progressive Test Time Energy Adaptation for Medical Image Segmentation
Abstract:
We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. Maintaining model performance across diverse medical datasets is challenging, as distribution shifts arise from inconsistent imaging protocols and patient variations. Unlike domain adaptation methods that require multiple passes through target data - impractical in clinical settings - our approach adapts pretrained models progressively as they process test data. Our method leverages a shape energy model trained on source data, which assigns an energy score at the patch level to segmentation maps: low energy represents in-distribution (accurate) shapes, while high energy signals out-of-distribution (erroneous) predictions. By minimizing this energy score at test time, we refine the segmentation model to align with the target distribution. To validate the effectiveness and adaptability, we evaluated our framework on eight public MRI (bSSFP, T1- and T2-weighted) and X-ray datasets spanning cardiac, spinal cord, and lung segmentation. We consistently outperform baselines both quantitatively and qualitatively.
中文摘要:我们提出了一种与模型无关的渐进式测试时能量自适应方法,通过最小化基于形状的能量分数在推理过程中优化预训练模型,在多种MRI和X射线数据集上实现了优于基线模型的医学图像分割性能。
English Summary: We introduce a model-agnostic, progressive test-time energy adaptation method for medical image segmentation that refines pretrained models during inference by minimizing a shape-based energy score, achieving superior performance across diverse MRI and X-ray datasets.

Authors:Katja Schwarz, Denys Rozumnyi, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder
Title: A Recipe for Generating 3D Worlds From a Single Image
Abstract:
We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/
中文: 该方法通过将单张图像生成沉浸式3D场景构建为2D修复模型的情境学习任务,仅需少量训练即可超越现有视频合成方法,在多项图像质量指标上表现优异。
English: This method generates immersive 3D worlds from a single image using in-context learning with 2D inpainting models, requiring minimal training and outperforming existing video-based techniques in quality metrics.

Authors:Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu
Title: MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Abstract:
Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at https://quanhaol.github.io/magicmotion-site.
中文摘要:MagicMotion是一种新型图像转视频框架,通过三种轨迹格式实现精确的多目标运动控制并保持视觉质量,同时配备了专用数据集和评估基准。
English Summary: MagicMotion is a novel image-to-video framework that enables precise multi-object motion control through three trajectory formats while maintaining visual quality, supported by a dedicated dataset and benchmark.

Authors:Shengjun Zhang, Xin Fei, Fangfu Liu, Haixu Song, Yueqi Duan
Title: Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis performance. While conventional methods require per-scene optimization, more recently several feed-forward methods have been proposed to generate pixel-aligned Gaussian representations with a learnable network, which are generalizable to different scenes. However, these methods simply combine pixel-aligned Gaussians from multiple views as scene representations, thereby leading to artifacts and extra memory cost without fully capturing the relations of Gaussians from different images. In this paper, we propose Gaussian Graph Network (GGN) to generate efficient and generalizable Gaussian representations. Specifically, we construct Gaussian Graphs to model the relations of Gaussian groups from different views. To support message passing at Gaussian level, we reformulate the basic graph operations over Gaussian representations, enabling each Gaussian to benefit from its connected Gaussian groups with Gaussian feature fusion. Furthermore, we design a Gaussian pooling layer to aggregate various Gaussian groups for efficient representations. We conduct experiments on the large-scale RealEstate10K and ACID datasets to demonstrate the efficiency and generalization of our method. Compared to the state-of-the-art methods, our model uses fewer Gaussians and achieves better image quality with higher rendering speed.
中文:提出的高斯图网络通过构建高斯图建模多视角关联,结合高斯特征融合与池化操作,以更少的高斯数量实现了更优的渲染质量与速度。
English: The proposed Gaussian Graph Network (GGN) models inter-view Gaussian relationships through graph operations and feature fusion, achieving superior rendering quality and speed with fewer Gaussians compared to state-of-the-art methods.

Authors:Ying Shen, Lifu Huang
Title: LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
Abstract:
Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN's value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBRACES, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs, much like a 'brace' providing support and stability. Moreover, LLMBRACES can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
中文: 最新研究表明,基于Transformer的大语言模型知识主要编码在前馈层中,而提出的LLMBRACES方法通过相关性评分动态调整子更新贡献,不仅能提升模型性能,还可实现细粒度生成控制,在减少高达75%参数的情况下仍优于基线方法。
English: Recent research shows that knowledge in Transformer-based LLMs is encoded in feed-forward layers, and the proposed LLMBRACES method dynamically adjusts sub-update contributions using relevance scores to enhance model performance and enable fine-grained control over generation, achieving superior results with fewer parameters.

Authors:Jinghan Zhang, Xiting Wang, Fengran Mo, Yeyang Zhou, Wanfu Gao, Kunpeng Liu
Title: Entropy-based Exploration Conduction for Multi-step Reasoning
Abstract:
Multi-step processes via large language models (LLMs) have proven effective for solving complex reasoning tasks. However, the depth of exploration of the reasoning procedure can significantly affect the task performance. Existing methods to automatically decide the depth often lead to high cost and a lack of flexibility. To address these issues, we propose Entropy-based Exploration Depth Conduction (Entro-duction), a novel method that dynamically adjusts the exploration depth during multi-step reasoning by monitoring LLM's output entropy and variance entropy. We employ these two features to capture the model's uncertainty of the current step and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed entropy changes, the LLM selects whether to deepen, expand, or stop exploration according to the probability, which facilitates the trade-off between the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro-duction.
中文: Entro-duction是一种通过监测大语言模型输出熵和方差熵来动态调整多步推理探索深度的新方法,在多个基准数据集上有效平衡了推理准确性和探索效率。
English: Entro-duction is a novel method that dynamically adjusts exploration depth in multi-step reasoning by monitoring LLM output entropy and variance entropy, effectively balancing reasoning accuracy and exploration efficiency across benchmark datasets.

Authors:Baolu Li, Zongzhe Xu, Jinlong Li, Xinyu Liu, Jianwu Fang, Xiaopeng Li, Hongkai Yu
Title: V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception
Abstract:
LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.
中文摘要:本文首次研究基于LiDAR的V2X协同感知领域泛化问题,提出CMAG方法与CFC正则化约束,在保持源域性能的同时实现跨未知域的高效泛化能力。
English Summary: This paper introduces the first study on domain generalization for LiDAR-based V2X cooperative perception, proposing a CMAG method with CFC regularization to maintain high performance across unseen domains while preserving source domain effectiveness.

Authors:Ruowen Zhao, Junliang Ye, Zhengyi Wang, Guangce Liu, Yiwen Chen, Yikai Wang, Jun Zhu
Title: DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Abstract:
Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: https://zhaorw02.github.io/DeepMesh/
中文: DeepMesh提出了一种创新框架,通过结合高效预训练与强化学习优化,实现了基于点云和图像的精细三维网格生成,在几何精度和视觉质量上均超越现有先进方法。
English: DeepMesh introduces a novel framework that enhances 3D mesh generation through efficient pre-training with advanced tokenization and reinforcement learning for human preference alignment, producing detailed and topologically accurate meshes from point clouds and images.

Authors:Mojtaba Esfandiari, Pengyuan Du, Haochen Wei, Peter Gehlbach, Adnan Munawar, Peter Kazanzides, Iulian Iordachita
Title: Model Predictive Path Integral Control of I2RIS Robot Using RBF Identifier and Extended Kalman Filter
Abstract:
Modeling and controlling cable-driven snake robots is a challenging problem due to nonlinear mechanical properties such as hysteresis, variable stiffness, and unknown friction between the actuation cables and the robot body. This challenge is more significant for snake robots in ophthalmic surgery applications, such as the Improved Integrated Robotic Intraocular Snake (I$^2$RIS), given its small size and lack of embedded sensory feedback. Data-driven models take advantage of global function approximations, reducing complicated analytical models' challenge and computational costs. However, their performance might deteriorate in case of new data unseen in the training phase. Therefore, adding an adaptation mechanism might improve these models' performance during snake robots' interactions with unknown environments. In this work, we applied a model predictive path integral (MPPI) controller on a data-driven model of the I$^2$RIS based on the Gaussian mixture model (GMM) and Gaussian mixture regression (GMR). To analyze the performance of the MPPI in unseen robot-tissue interaction situations, unknown external disturbances and environmental loads are simulated and added to the GMM-GMR model. These uncertainties of the robot model are then identified online using a radial basis function (RBF) whose weights are updated using an extended Kalman filter (EKF). Simulation results demonstrated the robustness of the optimal control solutions of the MPPI algorithm and its computational superiority over a conventional model predictive control (MPC) algorithm.
中文: 本研究通过将数据驱动的GMM-GMR模型与MPPI控制器相结合,并采用EKF更新的RBF网络实现在线自适应,提升了眼科手术中缆驱蛇形机器人的控制性能,在处理不确定性方面展现出比传统MPC更优的鲁棒性和计算效率。
English: This study enhances the control of cable-driven snake robots for ophthalmic surgery by combining a data-driven GMM-GMR model with an MPPI controller and online adaptation using an RBF network updated via EKF, demonstrating superior robustness and computational efficiency over traditional MPC in handling uncertainties.

Authors:Yuheng Li, Mingzhe Hu, Richard L. J. Qiu, Maria Thor, Andre Williams, Deborah Marshall, Xiaofeng Yang
Title: RoMedFormer: A Rotary-Embedding Transformer Foundation Model for 3D Genito-Pelvic Structure Segmentation in MRI and CT
Abstract:
Deep learning-based segmentation of genito-pelvic structures in MRI and CT is crucial for applications such as radiation therapy, surgical planning, and disease diagnosis. However, existing segmentation models often struggle with generalizability across imaging modalities, and anatomical variations. In this work, we propose RoMedFormer, a rotary-embedding transformer-based foundation model designed for 3D female genito-pelvic structure segmentation in both MRI and CT. RoMedFormer leverages self-supervised learning and rotary positional embeddings to enhance spatial feature representation and capture long-range dependencies in 3D medical data. We pre-train our model using a diverse dataset of 3D MRI and CT scans and fine-tune it for downstream segmentation tasks. Experimental results demonstrate that RoMedFormer achieves superior performance segmenting genito-pelvic organs. Our findings highlight the potential of transformer-based architectures in medical image segmentation and pave the way for more transferable segmentation frameworks.
中文: RoMedFormer是一种基于Transformer和旋转位置编码的模型,通过增强空间特征表示和捕获长程依赖关系,显著提升了MRI和CT中女性生殖盆腔结构的三维分割性能。
English: RoMedFormer, a transformer-based model with rotary embeddings, enhances 3D female genito-pelvic structure segmentation in MRI and CT by improving spatial feature representation and long-range dependency capture, achieving superior performance.

Authors:Tianshu Wu, Jiyao Zhang, Shiqian Liang, Zhengxiao Han, Hao Dong
Title: Foundation Feature-Driven Online End-Effector Pose Estimation: A Marker-Free and Learning-Free Approach
Abstract:
Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.
中文摘要:本文提出FEEPE算法,利用基础模型特征实现无需训练、支持跨机器人泛化的在线末端执行器姿态估计,通过2D-3D对应点匹配和时间优化技术,摆脱了传统手眼校准对标记物的依赖。
English Summary: This paper introduces FEEPE, a training-free online algorithm that leverages foundation model features for marker-free end-effector pose estimation, enabling cross-robot generalization through 2D-3D correspondence matching and temporal optimization.

Authors:Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, Haibin Yan
Title: MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
Abstract:
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.
中文摘要:MoManipVLA通过双级优化框架将预训练的视觉-语言-动作模型从固定基座操作迁移到移动操作,在提升任务成功率的同时显著降低了实际部署的训练成本。
English Summary: MoManipVLA is an efficient framework that adapts pre-trained vision-language-action models from fixed-base to mobile manipulation by optimizing both base movement and end-effector trajectories, achieving higher success rates with minimal training costs.

Authors:Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, Michael Rubinstein, Michalis Raptis, Deqing Sun, Radu Soricut
Title: Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Abstract:
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.
中文: UniFluid提出了一种统一的自回归框架,通过连续视觉令牌协同处理视觉生成与理解任务,借助优化的训练方案和强大预训练模型,在两项任务上均展现出卓越性能。
English: UniFluid introduces a unified autoregressive framework that effectively balances visual generation and understanding through continuous tokens, achieving competitive performance in both tasks by leveraging optimized training strategies and strong pre-trained models.

Authors:Wenyi Xu, Yuren Mao, Xiaolu Zhang, Chao Zhang, Xuemei Dong, Mengfei Zhang, Yunjun Gao
Title: DAgent: A Relational Database-Driven Data Analysis Report Generation Agent
Abstract:
Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks are manually completed by data scientists, making the process very labor-intensive and showing a clear need for automation. Although existing methods (e.g., Table QA or Text-to-SQL) have been proposed to reduce human dependency, they cannot handle complex analytical tasks that require multi-step reasoning, cross-table associations, and synthesizing insights into reports. Moreover, there is no dataset available for developing automatic RDB-DA report generation. To fill this gap, this paper proposes an LLM agent system for RDB-DA report generation tasks, dubbed DAgent; moreover, we construct a benchmark for automatic data analysis report generation, which includes a new dataset DA-Dataset and evaluation metrics. DAgent integrates planning, tools, and memory modules to decompose natural language questions into logically independent sub-queries, accurately retrieve key information from relational databases, and generate analytical reports that meet the requirements of completeness, correctness, and conciseness through multi-step reasoning and effective data integration. Experimental analysis on the DA-Dataset demonstrates that DAgent's superiority in retrieval performance and analysis report generation quality, showcasing its strong potential for tackling complex database analysis report generation tasks.
Chinese: 本文提出了DAgent,一种用于自动化复杂关系数据库驱动数据分析报告生成的LLM代理系统,通过集成规划、工具和记忆模块实现多步推理和数据整合,并构建了包含新数据集和评估指标的基准以填补现有资源空白。
English: This paper introduces DAgent, an LLM agent system designed to automate complex relational database-driven data analysis report generation by integrating planning, tools, and memory modules for multi-step reasoning and data synthesis, alongside a new benchmark dataset and metrics to address the lack of existing resources.

Authors:Shuaifan Jin, Xiaoyi Pang, Zhibo Wang, He Wang, Jiacheng Du, Jiahui Hu, Kui Ren
Title: Safeguarding LLM Embeddings in End-Cloud Collaboration via Entropy-Driven Perturbation
Abstract:
Recent studies improve on-device language model (LM) inference through end-cloud collaboration, where the end device retrieves useful information from cloud databases to enhance local processing, known as Retrieval-Augmented Generation (RAG). Typically, to retrieve information from the cloud while safeguarding privacy, the end device transforms original data into embeddings with a local embedding model. However, the recently emerging Embedding Inversion Attacks (EIAs) can still recover the original data from text embeddings (e.g., training a recovery model to map embeddings back to original texts), posing a significant threat to user privacy. To address this risk, we propose EntroGuard, an entropy-driven perturbation-based embedding privacy protection method, which can protect the privacy of text embeddings while maintaining retrieval accuracy during the end-cloud collaboration. Specifically, to defeat various EIAs, we perturb the embeddings to increase the entropy of the recovered text in the common structure of recovery models, thus steering the embeddings toward meaningless texts rather than original sensitive texts during the recovery process. To maintain retrieval performance in the cloud, we constrain the perturbations within a bound, applying the strategy of reducing them where redundant and increasing them where sparse. Moreover, EntroGuard can be directly integrated into end devices without requiring any modifications to the embedding model. Extensive experimental results demonstrate that EntroGuard can reduce the risk of privacy leakage by up to 8 times at most with negligible loss of retrieval performance compared to existing privacy-preserving methods.
中文摘要:最新研究提出EntroGuard这一基于熵驱动的扰动方法,能在终端-云端协作的语言模型推理中有效保护文本嵌入免遭逆向攻击,同时保持检索准确性。
English Summary: Recent research proposes EntroGuard, an entropy-driven perturbation method that protects text embeddings from inversion attacks while maintaining retrieval accuracy in end-cloud collaborative language model inference.

Authors:Yifan Zhan, Wangze Xu, Qingtian Zhu, Muyao Niu, Mingze Ma, Yifei Liu, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Title: R3-Avatar: Record and Retrieve Temporal Codebook for Reconstructing Photorealistic Human Avatars
Abstract:
We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a "record-retrieve-reconstruct" strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.
中文: R3-Avatar采用"记录-检索-重建"策略和时序码本,实现了可动画的高保真人体化身,有效解决了新姿态和复杂服装场景下的渲染质量退化问题。
English: R3-Avatar introduces a "record-retrieve-reconstruct" strategy with a temporal codebook to enable animatable human avatars with high-fidelity rendering, effectively overcoming quality degradation in novel poses and complex clothing scenarios.

Authors:Patrick Rim, Hyoungseob Park, S. Gangopadhyay, Ziyao Zeng, Younjoon Chung, Alex Wong
Title: ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
Abstract:
We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.
Chinese: ProtoDepth提出了一种基于原型的连续无监督深度补全方法,通过原型集调整潜在特征,显著减少了灾难性遗忘,并在基准数据集上达到了最先进的性能。
English: ProtoDepth introduces a prototype-based method for continual unsupervised depth completion, effectively mitigating catastrophic forgetting by adapting latent features with prototype sets and achieving state-of-the-art performance on benchmark datasets.

Authors:Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda
Title: Serenade: A Singing Style Conversion Framework Based On Audio Infilling
Abstract:
We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. We also found that resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, but found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.
中文: Serenade框架通过流匹配进行目标风格建模、循环训练实现源风格解耦,并利用原始F0模式保持旋律,在歌唱风格转换任务中展现出最佳整体相似度,尤其擅长处理气声与混合唱法风格。
English: The Serenade framework introduces a novel approach for singing style conversion by addressing key challenges through target style modeling with flow-matching, source style disentanglement via cyclic training, and melody retention using original F0 patterns, achieving superior similarity scores particularly for breathy and mixed styles.

Authors:Ruoyu Wang, Yukai Ma, Yi Yao, Sheng Tao, Haoang Li, Zongzhi Zhu, Yong Liu, Xingxing Zuo
Title: L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model
Abstract:
Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:https://studyingfufu.github.io/L2COcc/.
中文: 本文提出L2COcc轻量级相机中心语义场景补全框架,通过高效体素变换器和跨模态知识蒸馏模块,在保持高精度的同时显著降低计算负担,在多个基准测试中超越现有最优方法。
English: This paper introduces L2COcc, a lightweight camera-centric framework for Semantic Scene Completion that uses efficient voxel transformers and cross-modal knowledge distillation to significantly reduce computational costs while achieving superior accuracy on benchmark datasets.

Authors:Badr Souani, Ezekiel Soremekun, Mike Papadakis, Setsuko Yokoyama, Sudipta Chattopadhyay, Yves Le Traon
Title: HInter: Exposing Hidden Intersectional Bias in Large Language Models
Abstract:
Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.
中文: HInter技术通过结合变异分析和依存解析,有效检测大语言模型中的交叉偏见,并在多个模型中揭示了显著的隐藏偏差。
English: The HInter technique effectively detects intersectional bias in Large Language Models by combining mutation analysis and dependency parsing, revealing significant hidden biases across multiple models.

Authors:Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu
Title: DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning
Abstract:
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign.
中文摘要:DecAlign框架通过将多模态表征解耦为独有与共享特征,采用原型引导的最优传输对齐策略,在保留模态特性的同时显著提升了跨模态语义一致性,并在多项基准测试中超越现有最优方法。
English Summary: The DecAlign framework effectively separates multimodal representations into unique and shared features, using advanced alignment strategies to enhance cross-modal collaboration and achieve state-of-the-art performance across multiple benchmarks.

Authors:Jieming Bian, Lei Wang, Letian Zhang, Jie Xu
Title: FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA
Abstract:
Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-World (RoW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoW LoRA components, drawing conceptual foundations from the Mixture-of-Experts (MoE) paradigm. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.
中文摘要:FedALT提出了一种新颖的个性化联邦LoRA微调算法,通过自适应混合器动态平衡个体与共享知识,在保持计算效率的同时显著提升了本地适应性能。
English Summary: FedALT introduces a personalized federated LoRA fine-tuning algorithm that replaces FedAvg with an adaptive mixer to dynamically balance individual and shared knowledge, achieving superior local adaptation without compromising efficiency.

Authors:Moreno D'IncÃ, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini
Title: Safe Vision-Language Models via Unsafe Weights Manipulation
Abstract:
Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.
Chinese: 视觉语言模型(VLM)可能从训练数据中继承偏见和不安全行为,为此引入了SafeGround评估指标和无训练方法UWM,该方法通过操纵关键参数在提升不安全查询安全性的同时,保持安全内容的性能。
English: Vision-language models (VLMs) can exhibit biases and unsafe behaviors from training data, prompting the introduction of SafeGround metrics and Unsafe Weights Manipulation (UWM), a training-free method that enhances safety on unsafe queries while preserving performance on safe ones.

Authors:Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras
Title: Pathology Image Compression with Pre-trained Autoencoders
Abstract:
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at https://huggingface.co/collections/StonyBrook-CVLab/pathology-fine-tuned-aes-67d45f223a659ff2e3402dd0.
中文: 本研究开发了一种基于自动编码器的病理图像压缩框架,在显著提升存储和计算效率的同时保留关键诊断细节,并通过下游任务验证其性能损失极小。
English: This study develops an autoencoder-based compression framework for pathology images that preserves critical diagnostic details while significantly improving storage and computational efficiency, validated through minimal performance loss in downstream tasks.

Authors:Hang Shao, Lei Luo, Jianjun Qian, Mengkai Yan, Shuo Chen, Jian Yang
Title: Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios
Abstract:
Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global interference sharing, subject background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.
Chinese: 本文提出了一种用于远程光电容积描记术(rPPG)的端到端视频变换器模型,该模型能有效消除外部干扰,在真实户外场景中使用自然面部视频实现稳健性能。
English: This paper introduces an end-to-end video transformer model for remote photoplethysmography (rPPG) that effectively eliminates external interferences and performs robustly in real outdoor scenarios using natural face videos.

Authors:Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon
Title: PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture
Abstract:
Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: https://pbr3dgen1218.github.io/.
中文: PBR3DGen提出了一种两阶段网格生成方法,通过结合多视角材质估计和三维重建技术,实现了高质量的物理渲染材质生成,性能显著优于现有方法。
English: PBR3DGen introduces a two-stage mesh generation method that integrates multi-view material estimation and 3D reconstruction to produce high-quality physically based rendering materials, significantly outperforming existing approaches.

Authors:Lingyu Zhu, Xiangrui Zeng, Bolin Chen, Peilin Chen, Yung-Hui Li, Shiqi Wang
Title: Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning
Abstract:
By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.
中文: 本文提出了一种融合扩散知识和频率感知网络的新型生成式图像压缩方法,通过优化率-失真-真实感权衡,在压缩图像中恢复更真实的纹理细节,显著提升了失真与真实感的综合性能边界。
English: This paper introduces a novel generative image compression method that integrates diffusion knowledge and a frequency-aware network to optimize the rate-distortion-realism trade-off, producing more realistic textures and advancing the achievable boundaries of distortion-realism performance.

Authors:Rabimba Karanjai, Sam Blackshear, Lei Xu, Weidong Shi
Title: Collaboration is all you need: LLM Assisted Safe Code Translation
Abstract:
This paper introduces UniTranslator, a visionary framework that re-imagines code translation as a collaborative endeavor among multiple, compact LLMs. By orchestrating the interaction of specialized agents, each focused on different aspects of the translation process and grounded in a deep understanding of programming concepts, UniTranslator achieves a level of accuracy and efficiency that rivals larger, monolithic models. Our preliminary evaluation demonstrates the potential of UniTranslator to overcome the limitations of existing approaches and unlock the power of smaller LLMs for complex code translation tasks. We explore the effectiveness of this dynamic multi-agent paradigm in handling diverse language pairs, including low-resource languages, and in mitigating common issues such as code artifacts and hallucinations through the use of Natural Language Inference (NLI) grounding and iterative feedback mechanisms
UniTranslator 是一种创新框架,通过协调多个专业化的小型大语言模型,利用自然语言推理和迭代反馈机制,显著提升了代码翻译的准确性和效率,有效处理多语言任务并减少错误。
UniTranslator is a novel framework that enhances code translation accuracy and efficiency by coordinating multiple specialized small LLMs, leveraging natural language inference and iterative feedback to handle diverse languages and reduce errors.

Authors:Norbert Tihanyi, Tamas Bisztray, Mohamed Amine Ferrag, Bilel Cherif, Richard A. Dubniczky, Ridhi Jain, Lucas C. Cordeiro
Title: Vulnerability Detection: From Formal Verification to Large Language Models and Hybrid Approaches: A Comprehensive Overview
Abstract:
Software testing and verification are critical for ensuring the reliability and security of modern software systems. Traditionally, formal verification techniques, such as model checking and theorem proving, have provided rigorous frameworks for detecting bugs and vulnerabilities. However, these methods often face scalability challenges when applied to complex, real-world programs. Recently, the advent of Large Language Models (LLMs) has introduced a new paradigm for software analysis, leveraging their ability to understand insecure coding practices. Although LLMs demonstrate promising capabilities in tasks such as bug prediction and invariant generation, they lack the formal guarantees of classical methods. This paper presents a comprehensive study of state-of-the-art software testing and verification, focusing on three key approaches: classical formal methods, LLM-based analysis, and emerging hybrid techniques, which combine their strengths. We explore each approach's strengths, limitations, and practical applications, highlighting the potential of hybrid systems to address the weaknesses of standalone methods. We analyze whether integrating formal rigor with LLM-driven insights can enhance the effectiveness and scalability of software verification, exploring their viability as a pathway toward more robust and adaptive testing frameworks.
Chinese: 本文研究了软件测试中的传统形式化方法、基于大语言模型的分析及混合技术,强调将形式化严谨性与大语言模型的洞察力相结合,可提升验证效果与可扩展性。
English: This paper examines classical formal methods, LLM-based analysis, and hybrid techniques in software testing, highlighting how combining formal rigor with LLM insights can enhance verification effectiveness and scalability.

Authors:Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, Seid Muhie Yimam
Title: CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding
Abstract:
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.
中文:现有情感基准常忽略文化细微差别,依赖关键词或翻译数据,为此我们开发了CuLEmo这一文化感知基准,在六种语言中评估大语言模型,揭示了情感理解中显著的文化和语言差异。
English: Current emotion benchmarks often overlook cultural nuances and rely on keyword-based methods or translated data, prompting the creation of CuLEmo, a culture-aware benchmark that evaluates LLMs across six languages and reveals significant cultural and linguistic variations in emotion understanding.

Authors:Yijiang Fan, Yuren Mao, Longbin Lai, Ying Zhang, Zhengping Qian, Yunjun Gao
Title: G-Boost: Boosting Private SLMs with General LLMs
Abstract:
Due to the limited computational resources, most Large Language Models (LLMs) developers can only fine-tune Small Language Models (SLMs) on their own data. These private SLMs typically have limited effectiveness. To boost the performance of private SLMs, this paper proposes to ask general LLMs for help. The general LLMs can be APIs or larger LLMs whose inference cost the developers can afford. Specifically, we propose the G-Boost framework where a private SLM adaptively performs collaborative inference with a general LLM under the guide of process reward. Experiments demonstrate that our framework can significantly boost the performance of private SLMs.
中文摘要:本文提出G-Boost框架,通过让私有小型语言模型在过程奖励指导下与通用大语言模型进行自适应协同推理,显著提升私有模型的性能表现。
English Summary: This paper introduces the G-Boost framework, enabling private Small Language Models (SLMs) to enhance their performance through adaptive collaborative inference with general LLMs guided by process rewards.

Authors:Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang
Title: Collaborative Speculative Inference for Efficient LLM Inference Serving
Abstract:
Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
Chinese: 推测推理利用小型模型生成草稿令牌供大型语言模型并行验证以提高效率,但面临资源利用不足等挑战,CoSine通过解耦草稿生成与验证并优化工作流,显著降低延迟并提升吞吐量。
English: Speculative inference uses small models to draft tokens for parallel verification by large language models, improving efficiency but facing challenges like resource underutilization, which CoSine addresses by decoupling drafting from verification and optimizing workflows to significantly reduce latency and boost throughput.

Authors:Shilong Wang, Jianchun Liu, Hongli Xu, Jiaming Yan, Xianjun Gao
Title: Efficient Federated Fine-Tuning of Large Language Models with Layer Dropout
Abstract:
Fine-tuning plays a crucial role in enabling pre-trained LLMs to evolve from general language comprehension to task-specific expertise. To preserve user data privacy, federated fine-tuning is often employed and has emerged as the de facto paradigm. However, federated fine-tuning is prohibitively inefficient due to the tension between LLM complexity and the resource constraint of end devices, incurring unaffordable fine-tuning overhead. Existing literature primarily utilizes parameter-efficient fine-tuning techniques to mitigate communication costs, yet computational and memory burdens continue to pose significant challenges for developers. This work proposes DropPEFT, an innovative federated PEFT framework that employs a novel stochastic transformer layer dropout method, enabling devices to deactivate a considerable fraction of LLMs layers during training, thereby eliminating the associated computational load and memory footprint. In DropPEFT, a key challenge is the proper configuration of dropout ratios for layers, as overhead and training performance are highly sensitive to this setting. To address this challenge, we adaptively assign optimal dropout-ratio configurations to devices through an exploration-exploitation strategy, achieving efficient and effective fine-tuning. Extensive experiments show that DropPEFT can achieve a 1.3-6.3\times speedup in model convergence and a 40%-67% reduction in memory footprint compared to state-of-the-art methods.
Chinese: DropPEFT通过引入随机Transformer层丢弃方法,在联邦PEFT中自适应配置丢弃率,有效降低了计算和内存负担,实现了模型收敛加速和内存占用的显著优化。
English: DropPEFT introduces a stochastic transformer layer dropout method within federated PEFT to reduce computational and memory overhead, achieving significant speedup and memory savings through adaptive dropout-ratio optimization.

Authors:Peng Hu, Chunming He, Lei Xu, Jingduo Tian, Sina Farsiu, Yulun Zhang, Pei Liu, Xiu Li
Title: IQPFR: An Image Quality Prior for Blind Face Restoration and Beyond
Abstract:
Blind Face Restoration (BFR) addresses the challenge of reconstructing degraded low-quality (LQ) facial images into high-quality (HQ) outputs. Conventional approaches predominantly rely on learning feature representations from ground-truth (GT) data; however, inherent imperfections in GT datasets constrain restoration performance to the mean quality level of the training data, rather than attaining maximally attainable visual quality. To overcome this limitation, we propose a novel framework that incorporates an Image Quality Prior (IQP) derived from No-Reference Image Quality Assessment (NR-IQA) models to guide the restoration process toward optimal HQ reconstructions. Our methodology synergizes this IQP with a learned codebook prior through two critical innovations: (1) During codebook learning, we devise a dual-branch codebook architecture that disentangles feature extraction into universal structural components and HQ-specific attributes, ensuring comprehensive representation of both common and high-quality facial characteristics. (2) In the codebook lookup stage, we implement a quality-conditioned Transformer-based framework. NR-IQA-derived quality scores act as dynamic conditioning signals to steer restoration toward the highest feasible quality standard. This score-conditioned paradigm enables plug-and-play enhancement of existing BFR architectures without modifying the original structure. We also formulate a discrete representation-based quality optimization strategy that circumvents over-optimization artifacts prevalent in continuous latent space approaches. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques across multiple benchmarks. Besides, our quality-conditioned framework demonstrates consistent performance improvements when integrated with prior BFR models. The code will be released.
中文: 本文提出了一种新颖的盲人脸恢复框架,通过结合无参考图像质量评估模型提供的图像质量先验与双分支码本,利用质量条件化Transformer实现超越传统方法限制的优质重建效果。
English: This paper introduces a novel Blind Face Restoration framework that integrates an Image Quality Prior from No-Reference Image Quality Assessment models with a dual-branch codebook, using quality-conditioned Transformers to achieve superior high-quality reconstructions beyond conventional methods' limitations.

Authors:Yu Bu, Yulin Zhu, Kai Zhou
Title: Crowdsourced Homophily Ties Based Graph Annotation Via Large Language Model
Abstract:
Accurate graph annotation typically requires substantial labeled data, which is often challenging and resource-intensive to obtain. In this paper, we present Crowdsourced Homophily Ties Based Graph Annotation via Large Language Model (CSA-LLM), a novel approach that combines the strengths of crowdsourced annotations with the capabilities of large language models (LLMs) to enhance the graph annotation process. CSA-LLM harnesses the structural context of graph data by integrating information from 1-hop and 2-hop neighbors. By emphasizing homophily ties - key connections that signify similarity within the graph - CSA-LLM significantly improves the accuracy of annotations. Experimental results demonstrate that this method enhances the performance of Graph Neural Networks (GNNs) by delivering more precise and reliable annotations.
Chinese: CSA-LLM方法结合众包标注与大型语言模型,通过利用1跳和2跳邻居的结构上下文并强调同质性连接来提升图标注的准确性,从而优化图神经网络的性能。
English: The CSA-LLM method leverages crowdsourced annotations and large language models to improve graph annotation accuracy by utilizing structural context from 1-hop and 2-hop neighbors and emphasizing homophily ties, thereby enhancing Graph Neural Network performance.

Authors:Zirui Gong, Yanjun Zhang, Leo Yu Zhang, Zhaoxi Zhang, Yong Xiang, Shirui Pan
Title: Not All Edges are Equally Robust: Evaluating the Robustness of Ranking-Based Federated Learning
Abstract:
Federated Ranking Learning (FRL) is a state-of-the-art FL framework that stands out for its communication efficiency and resilience to poisoning attacks. It diverges from the traditional FL framework in two ways: 1) it leverages discrete rankings instead of gradient updates, significantly reducing communication costs and limiting the potential space for malicious updates, and 2) it uses majority voting on the server side to establish the global ranking, ensuring that individual updates have minimal influence since each client contributes only a single vote. These features enhance the system's scalability and position FRL as a promising paradigm for FL training. However, our analysis reveals that FRL is not inherently robust, as certain edges are particularly vulnerable to poisoning attacks. Through a theoretical investigation, we prove the existence of these vulnerable edges and establish a lower bound and an upper bound for identifying them in each layer. Based on this finding, we introduce a novel local model poisoning attack against FRL, namely the Vulnerable Edge Manipulation (VEM) attack. The VEM attack focuses on identifying and perturbing the most vulnerable edges in each layer and leveraging an optimization-based approach to maximize the attack's impact. Through extensive experiments on benchmark datasets, we demonstrate that our attack achieves an overall 53.23% attack impact and is 3.7x more impactful than existing methods. Our findings highlight significant vulnerabilities in ranking-based FL systems and underline the urgency for the development of new robust FL frameworks.
Chinese: 联邦排序学习(FRL)作为一种通信高效且抗中毒攻击的框架,仍存在特定边缘的脆弱性,易受针对性操纵,脆弱边缘操纵(VEM)攻击的高影响力揭示了其安全隐患。
English: Federated Ranking Learning (FRL) is a communication-efficient framework resilient to poisoning attacks, but it remains vulnerable to targeted manipulation of specific edges, as demonstrated by the high-impact Vulnerable Edge Manipulation (VEM) attack.

Authors:Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh
Title: EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Abstract:
An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for language model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also introduce a weakness profiling method EvalTree. EvalTree constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we provide an interface that allows practitioners to interactively explore the capability trees built by EvalTree.
An ideal model evaluation should pinpoint model failures and offer actionable improvement guidance, which EvalTree achieves by generating precise weakness profiles through capability tree analysis, outperforming baseline methods and enabling more effective training data collection.
English Summary:

Authors:Hubert Baniecki, Przemyslaw Biecek
Title: Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning
Abstract:
A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of part-prototype networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.
Chinese: 该研究通过展示原型操纵和后门攻击等对抗性策略如何利用潜在原型破坏模型的可信度,质疑了内在可解释深度学习模型所假定的可靠性。
English: The study challenges the assumed reliability of intrinsically interpretable deep learning models by demonstrating their vulnerability to adversarial attacks, such as prototype manipulation and backdoor attacks, which exploit latent prototypes and undermine their trustworthiness.

Authors:Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, André Loureiro Espírito Santo, Martijn de Vos, Milos Vujasinovic
Title: Accelerating MoE Model Inference with Expert Sharding
Abstract:
Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4$\times$ in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.
中文: MoEShard是一种创新的推理系统,通过专家张量的行列分解策略,在编码器混合专家模型中实现完美负载均衡,相比现有方法可保留全部令牌并提升首令牌生成速度达6.4倍。
English: MoEShard is an innovative inference system that achieves perfect load balancing in encoder-based mixture of experts models through strategic tensor sharding of experts, enabling full token retention and up to 6.4× faster time to first token compared to existing approaches.

Authors:Tao Shen, Didi Zhu, Ziyu Zhao, Zexi Li, Chao Wu, Fei Wu
Title: Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Abstract:
The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.
中文: 基础模型的扩展受限于高质量数据和计算资源,但通过联邦学习利用分布式边缘设备,可以实现协作训练,从而推动人工智能的民主化发展。
English: The scaling of foundation models is hindered by limited high-quality data and computational resources, but leveraging distributed edge devices through federated learning can democratize AI development by enabling collaborative training.

Authors:Philipp Straubinger, Marvin Kreis, Stephan Lukasczyk, Gordon Fraser
Title: Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging
Abstract:
Large Language Models (LLMs) can generate plausible test code. Intuitively they generate this by imitating tests seen in their training data, rather than reasoning about execution semantics. However, such reasoning is important when applying mutation testing, where individual tests need to demonstrate differences in program behavior between a program and specific artificial defects (mutants). In this paper, we evaluate whether Scientific Debugging, which has been shown to help LLMs when debugging, can also help them to generate tests for mutants. In the resulting approach, LLMs form hypotheses about how to kill specific mutants, and then iteratively generate and refine tests until they succeed, all with detailed explanations for each step. We compare this method to three baselines: (1) directly asking the LLM to generate tests, (2) repeatedly querying the LLM when tests fail, and (3) search-based test generation with Pynguin. Our experiments evaluate these methods based on several factors, including mutation score, code coverage, success rate, and the ability to identify equivalent mutants. The results demonstrate that LLMs, although requiring higher computation cost, consistently outperform Pynguin in generating tests with better fault detection and coverage. Importantly, we observe that the iterative refinement of test cases is important for achieving high-quality test suites.
中文: 采用科学调试方法的大语言模型通过迭代生成和优化针对变异体的测试,尽管计算成本更高,但在缺陷检测和覆盖率方面始终优于Pynguin等传统方法。
English: Large Language Models (LLMs) using Scientific Debugging iteratively generate and refine tests for mutants, outperforming traditional methods like Pynguin in fault detection and coverage despite higher computational costs.

Authors:Jianhui Wang, Zhifei Yang, Yangfan He, Huixiong Zhang, Yuxuan Chen, Jingwei Huang
Title: MaRI: Material Retrieval Integration across Domains
Abstract:
Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training an image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
Chinese: 我们提出了MaRI框架,通过对比学习构建共享嵌入空间,弥合合成与现实世界材料之间的差距,显著提升了材料检索的准确性和在不同任务中的泛化能力。
English: We introduce MaRI, a framework that bridges the gap between synthetic and real-world materials by creating a shared embedding space through contrastive learning, significantly improving material retrieval accuracy and generalization across diverse tasks.

Authors:Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Y. Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song
Title: TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster
Abstract:
Large Language Models (LLMs) and Foundation Models (FMs) have recently become prevalent for time series forecasting tasks. While fine-tuning LLMs enables domain adaptation, they often struggle to generalize across diverse and unseen datasets. Moreover, existing Time Series Foundation Models (TSFMs) still face challenges in handling non-stationary dynamics and distribution shifts, largely due to the lack of effective mechanisms for adaptation. To this end, we present TS-RAG, a retrieval-augmented generation framework for time series forecasting that enhances the generalization and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant segments from a dedicated knowledge base, enriching the contextual representation of the input query. Furthermore, we propose an Adaptive Retrieval Mixer (ARM) module that dynamically fuses the retrieved patterns with the TSFM's internal representation, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming the existing TSFMs by up to 6.84% across diverse domains while also providing desirable interpretability.
中文:TS-RAG提出了一种检索增强生成框架,通过动态融合相关历史模式来改进时间序列预测,无需任务特定微调即可实现领先的零样本预测性能。
English: TS-RAG introduces a retrieval-augmented generation framework that enhances time series forecasting by dynamically integrating relevant historical patterns, achieving state-of-the-art zero-shot performance without task-specific fine-tuning.

Authors:Chengmeng Li, Junjie Wen, Yan Peng, Yaxin Peng, Feifei Feng, Yichen Zhu
Title: PointVLA: Injecting the 3D World into Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less useful blocks in the vanilla action expert, ensuring that 3D features are injected only into these blocks--minimizing disruption to pre-trained representations. Extensive experiments demonstrate that PointVLA outperforms state-of-the-art 2D imitation learning methods, such as OpenVLA, Diffusion Policy and DexVLA, across both simulated and real-world robotic tasks. Specifically, we highlight several key advantages of PointVLA enabled by point cloud integration: (1) Few-shot multi-tasking, where PointVLA successfully performs four different tasks using only 20 demonstrations each; (2) Real-vs-photo discrimination, where PointVLA distinguishes real objects from their images, leveraging 3D world knowledge to improve safety and reliability; (3) Height adaptability, Unlike conventional 2D imitation learning methods, PointVLA enables robots to adapt to objects at varying table height that unseen in train data. Furthermore, PointVLA achieves strong performance in long-horizon tasks, such as picking and packing objects from a moving conveyor belt, showcasing its ability to generalize across complex, dynamic environments.
中文: PointVLA通过轻量级模块将点云输入集成到预训练的视觉-语言-动作模型中,无需重新训练即可在机器人任务中实现卓越性能,具备少样本多任务处理和真实世界适应性等优势。
English: PointVLA enhances pre-trained Vision-Language-Action models by integrating point cloud inputs through a lightweight modular block without retraining, enabling superior performance in robotic tasks with advantages like few-shot multi-tasking and real-world adaptability.

Authors:Bangyan Li, Wenxuan Huang, Zhenkun Gao, Yeqiang Wang, Yunhang Shen, Jingzhong Lin, Ling You, Yuxiang Shen, Shaohui Lin, Wanli Ouyang, Yuling Sun
Title: LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, we found that MLLMs cannot process effectively from fine-grained medical image data in the traditional Visual Question Answering (VQA) pipeline, as they do not exploit the captured features and available medical knowledge fully, results in MLLMs usually performing poorly in zero-shot medical disease recognition. Fortunately, this limitation does not indicate that MLLMs are fundamentally incapable of addressing fine-grained recognition tasks. From a feature representation perspective, MLLMs demonstrate considerable potential for tackling such challenging problems. Thus, to address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. Extensive experiments demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition, achieving the comparable performance to the well-established and highly-optimized CLIP-based approaches.
中文: 多模态大语言模型在传统视觉问答流程中因未能充分利用医学图像特征和知识而表现不佳,但通过提出的LLaVA-RadZ框架,结合特征对齐和领域知识锚定模块,显著提升了零样本疾病识别能力,达到与优化CLIP方法相当的性能。
English: Multimodal Large Language Models (MLLMs) struggle with fine-grained medical image analysis in traditional VQA pipelines due to underutilized features and medical knowledge, but the proposed LLaVA-RadZ framework, incorporating feature alignment and domain knowledge modules, significantly enhances zero-shot disease recognition to match CLIP-based methods.

Authors:Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Yisen Zhao, Yonghong Tian
Title: Temporal Triplane Transformers as Occupancy World Models
Abstract:
World models aim to learn or construct representations of the environment that enable the prediction of future scenes, thereby supporting intelligent motion planning. However, existing models often struggle to produce fine-grained predictions and to operate in real time. In this work, we propose T$^3$Former, a novel 4D occupancy world model for autonomous driving. T$^3$Former begins by pre-training a compact {\em triplane} representation that efficiently encodes 3D occupancy. It then extracts multi-scale temporal motion features from historical triplanes and employs an autoregressive approach to iteratively predict future triplane changes. Finally, these triplane changes are combined with previous states to decode future occupancy and ego-motion trajectories. Experimental results show that T$^3$Former achieves 1.44$\times$ speedup (26 FPS), improves mean IoU to 36.09, and reduces mean absolute planning error to 1.0 meters. Demos are available in the supplementary material.
中文: 提出的Delta-Triplane Transformers(DTT)模型通过紧凑三平面表示和增量式占据预测,实现了更快速、更精确的自动驾驶场景预测与运动规划。
English: The proposed Delta-Triplane Transformers (DTT) model introduces a compact triplane representation and incremental occupancy prediction to achieve faster, more accurate autonomous driving scene forecasting and motion planning.

Authors:Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Yisen Zhao, Yonghong Tian
Title: Delta-Triplane Transformers as Occupancy World Models
Abstract:
Occupancy World Models (OWMs) aim to predict future scenes via 3D voxelized representations of the environment to support intelligent motion planning. Existing approaches typically generate full future occupancy states from VAE-style latent encodings, which can be computationally expensive and redundant. We propose Delta-Triplane Transformers (DTT), a novel 4D OWM for autonomous driving, that introduces two key innovations: (1) a triplane based representation that encodes 3D occupancy more compactly than previous approaches, and (2) an incremental prediction strategy for OWM that models {\em changes} in occupancy rather than dealing with full states. The core insight is that changes in the compact 3D latent space are naturally sparser and easier to model, enabling higher accuracy with a lighter-weight architecture. Building on this representation, DTT extracts multi-scale motion features from historical data and iteratively predict future triplane deltas. These deltas are combined with past states to decode future occupancy and ego-motion trajectories. Extensive experiments demonstrate that DTT delivers a 1.44$\times$ speedup (26 FPS) over the state of the art, improves mean IoU to 30.85, and reduces the mean absolute planning error to 1.0 meters. Demo videos are provided in the supplementary material.
中文: 提出的Delta-Triplane Transformers(DTT)模型通过紧凑三平面表示和增量式占据预测,实现了更快速、更精确的自动驾驶场景预测与运动规划。
English: The proposed Delta-Triplane Transformers (DTT) model introduces a compact triplane representation and incremental occupancy prediction to achieve faster, more accurate autonomous driving scene forecasting and motion planning.

Authors:Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang
Title: Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation
Abstract:
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose \textbf{T}emporal \textbf{O}verlapping \textbf{P}rediction (\textbf{TOP}), a self-supervised pre-training method that alleviate the labeling burden for MOS. \textbf{TOP} explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called $\text{mIoU}_{\text{obj}}$ to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that \textbf{TOP} outperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77\% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.
中文摘要:本文提出时序重叠预测(TOP)方法,通过利用激光雷达序列中的时序运动线索进行自监督预训练,有效降低移动物体分割对标注数据的依赖,在多个数据集上实现了显著的性能提升。
English Summary: This paper introduces Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that leverages temporal motion cues in LiDAR sequences to reduce annotation dependency for moving object segmentation, achieving significant performance improvements across multiple datasets.

Authors:Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang
Title: Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation
Abstract:
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.
中文摘要:本文提出时序重叠预测(TOP)方法,通过利用激光雷达序列中的时序运动线索进行自监督预训练,有效降低移动物体分割对标注数据的依赖,在多个数据集上实现了显著的性能提升。
English Summary: This paper introduces Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that leverages temporal motion cues in LiDAR sequences to reduce annotation dependency for moving object segmentation, achieving significant performance improvements across multiple datasets.

Authors:Apivich Hemachandra, Gregory Kang Ruey Lau, See-Kiong Ng, Bryan Kian Hsiang Low
Title: PIED: Physics-Informed Experimental Design for Inverse Problems
Abstract:
In many science and engineering settings, system dynamics are characterized by governing PDEs, and a major challenge is to solve inverse problems (IPs) where unknown PDE parameters are inferred based on observational data gathered under limited budget. Due to the high costs of setting up and running experiments, experimental design (ED) is often done with the help of PDE simulations to optimize for the most informative design parameters to solve such IPs, prior to actual data collection. This process of optimizing design parameters is especially critical when the budget and other practical constraints make it infeasible to adjust the design parameters between trials during the experiments. However, existing experimental design (ED) methods tend to require sequential and frequent design parameter adjustments between trials. Furthermore, they also have significant computational bottlenecks due to the need for complex numerical simulations for PDEs, and do not exploit the advantages provided by physics informed neural networks (PINNs), such as its meshless solutions, differentiability, and amortized training. This work presents PIED, the first ED framework that makes use of PINNs in a fully differentiable architecture to perform continuous optimization of design parameters for IPs for one-shot deployments. PIED overcomes existing methods' computational bottlenecks through parallelized computation and meta-learning of PINN parameter initialization, and proposes novel methods to effectively take into account PINN training dynamics in optimizing the ED parameters. Through experiments based on noisy simulated data and even real world experimental data, we empirically show that given limited observation budget, PIED significantly outperforms existing ED methods in solving IPs, including challenging settings where the inverse parameters are unknown functions rather than just finite-dimensional.
Chinese: 本研究提出了PIED框架,首次利用物理信息神经网络(PINN)的全可微分架构,通过并行计算和元学习优化实验设计参数,在有限观测预算下显著提升了求解反演问题的性能,尤其适用于参数为未知函数的复杂场景。
English: This work introduces PIED, a novel experimental design framework that leverages physics-informed neural networks (PINNs) in a fully differentiable architecture to optimize design parameters for one-shot deployment in solving inverse problems, overcoming computational bottlenecks and outperforming existing methods under limited observation budgets.

Authors:Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun
Title: DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Abstract:
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
中文: DistiLLM-2提出了一种对比方法,通过提高教师回答概率并降低学生回答概率来增强学生模型性能,在多种任务和应用中展现出显著效果。
English: DistiLLM-2 introduces a contrastive method that enhances student model performance by increasing the likelihood of teacher responses while decreasing student-generated ones, demonstrating effectiveness across diverse tasks and applications.

Authors:Keyu Du, Hao Xu, Haipeng Li, Hong Qu, Chi-Wing Fu, Shuaicheng Liu
Title: HybridReg: Robust 3D Point Cloud Registration with Hybrid Motions
Abstract:
Scene-level point cloud registration is very challenging when considering dynamic foregrounds. Existing indoor datasets mostly assume rigid motions, so the trained models cannot robustly handle scenes with non-rigid motions. On the other hand, non-rigid datasets are mainly object-level, so the trained models cannot generalize well to complex scenes. This paper presents HybridReg, a new approach to 3D point cloud registration, learning uncertainty mask to account for hybrid motions: rigid for backgrounds and non-rigid/rigid for instance-level foregrounds. First, we build a scene-level 3D registration dataset, namely HybridMatch, designed specifically with strategies to arrange diverse deforming foregrounds in a controllable manner. Second, we account for different motion types and formulate a mask-learning module to alleviate the interference of deforming outliers. Third, we exploit a simple yet effective negative log-likelihood loss to adopt uncertainty to guide the feature extraction and correlation computation. To our best knowledge, HybridReg is the first work that exploits hybrid motions for robust point cloud registration. Extensive experiments show HybridReg's strengths, leading it to achieve state-of-the-art performance on both widely-used indoor and outdoor datasets.
Chinese: HybridReg提出了一种新的三维点云配准方法,通过学习不确定性掩码处理混合运动(刚性背景与非刚性/刚性前景),在室内外数据集上均实现了最先进的性能。
English: HybridReg introduces a novel 3D point cloud registration method that learns uncertainty masks to handle hybrid motions—rigid backgrounds and non-rigid/rigid foregrounds—achieving state-of-the-art results on indoor and outdoor datasets.

Authors:Wentao Wu, Chenglong Li, Xiao Wang, Bin Luo, Qi Liu
Title: Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection
Abstract:
Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities, which makes it difficult to achieve accurate semantic and spatial alignments, limiting detection performance. To address this problem, we propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet, which leverages the semantic features extracted from a large language model to guide the progressive semantic and spatial alignment between modalities for multimodal UAV object detection. To employ the powerful semantic representation of LLM, we generate the fine-grained text descriptions of each object category by ChatGPT and then extract the semantic features using the large language model MPNet. Based on the semantic features, we guide the semantic and spatial alignments in a progressive manner as follows. First, we design the Semantic Alignment Module (SAM) to pull the semantic features and multimodal visual features of each object closer, alleviating the semantic differences of objects between modalities. Second, we design the Explicit Spatial alignment Module (ESM) by integrating the semantic relations into the estimation of feature-level offsets, alleviating the coarse spatial misalignment between modalities. Finally, we design the Implicit Spatial alignment Module (ISM), which leverages the cross-modal correlations to aggregate key features from neighboring regions to achieve implicit spatial alignment. Comprehensive experiments on two public multimodal UAV object detection datasets demonstrate that our approach outperforms state-of-the-art multimodal UAV object detectors.
中文摘要:提出的LPANet利用大语言模型特征逐步实现多模态无人机数据的语义和空间对齐,在目标检测性能上超越了现有最优方法。
English Summary: The proposed LPANet uses large language model features to progressively align multimodal UAV data both semantically and spatially, achieving superior object detection performance over existing methods.

Authors:Zhiheng Yu, Jiancheng An, Lu Gan, Hongbin Li, Symeon Chatzinotas
Title: Weighted Codebook Scheme for RIS-Assisted Point-to-Point MIMO Communications
Abstract:
Reconfigurable intelligent surfaces (RIS) can reshape the characteristics of wireless channels by intelligently regulating the phase shifts of reflecting elements. Recently, various codebook schemes have been utilized to optimize the reflection coefficients (RCs); however, the selection of the optimal codeword is usually obtained by evaluating a metric of interest. In this letter, we propose a novel weighted design on the discrete Fourier transform (DFT) codebook to obtain the optimal RCs for RIS-assisted point-to-point multiple-input multiple-output (MIMO) systems. Specifically, we first introduce a channel training protocol where we configure the RIS RCs using the DFT codebook to obtain a set of observations through the uplink training process. Secondly, based on these observed samples, the Lagrange multiplier method is utilized to optimize the weights in an iterative manner, which could result in a higher channel capacity for assisting in the downlink data transmission. Thirdly, we investigate the effect of different codeword configuration orders on system performance and design an efficient codeword configuration method based on statistical channel state information (CSI). Finally, numerical simulations are provided to demonstrate the performance of the proposed scheme.
中文: 本文提出了一种新型加权DFT码本设计,通过拉格朗日乘子迭代优化和基于统计CSI的码字配置方法,有效提升RIS辅助MIMO系统的信道容量。
English: This letter introduces a novel weighted DFT codebook design to optimize reflection coefficients for RIS-assisted MIMO systems, employing iterative Lagrange multiplier optimization and efficient codeword configuration to enhance channel capacity.

Authors:Hanqing Liu, Shouwei Ruan, Yao Huang, Shiji Zhao, Xingxing Wei
Title: When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack
Abstract:
Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.
中文: 提出的光照变换攻击(ITA)通过参数化光源建模和基于物理的渲染系统评估视觉语言模型对光照变化的脆弱性,同时利用潜在空间控制和感知约束保持图像自然性,实验表明该方法能显著降低先进模型的性能。
English: The proposed Illumination Transformation Attack (ITA) systematically evaluates vision-language models' vulnerability to illumination changes by modeling diverse lighting through parameterized light sources and physics-based rendering, while maintaining naturalness through latent space control and perceptual constraints.

Authors:Xin Liu, Jie Liu, Jie Tang, Gangshan Wu
Title: CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
Abstract:
Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
中文: 基于Transformer的方法在图像超分辨率任务中表现出色但计算复杂度高,为此提出的CATANet通过高效聚合内容相似的标记来捕获长程依赖,实现了更优的性能和更快的推理速度。
English: Transformer-based methods excel in image super-resolution but face high computational costs, which the proposed CATANet addresses by efficiently aggregating content-similar tokens to capture long-range dependencies, achieving superior performance and faster inference.

Authors:Xiangyan Qu, Jing Yu, Jiamin Zhuang, Gaopeng Gou, Gang Xiong, Qi Wu
Title: MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification
Abstract:
Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.
中文摘要:提出的MADS框架利用大语言模型过滤文档中的非视觉噪声,通过语义交互提取多视角知识,在零样本学习基准测试中实现了最先进的性能并提升了可解释性。
English Summary: The proposed MADS framework leverages large language models to filter non-visual noise from documents and extracts multi-view knowledge through semantic interactions, achieving state-of-the-art performance in zero-shot learning benchmarks with improved interpretability.

Authors:Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang
Title: VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
Abstract:
Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/.
Chinese: VideoPhy-2作为评估视频生成模型物理常识的新基准,揭示了模型在遵循现实世界物理规则方面的重大缺陷,尤其在守恒定律方面表现不佳,最优模型在其困难子集上仅达到22%的综合性能。
English: VideoPhy-2 is a new benchmark that evaluates video generative models' physical commonsense, revealing significant shortcomings in adhering to real-world physics, particularly in conservation laws, with the best model achieving only 22% joint performance on its hard subset.

Authors:Yihong Luo, Tianyang Hu, Yifan Song, Jiacheng Sun, Zhenguo Li, Jing Tang
Title: Adding Additional Control to One-Step Diffusion with Joint Distribution Matching
Abstract:
While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.
中文摘要:JDM通过非对称蒸馏方案使学生模型能够处理教师模型未知的新控制条件,并通过改进分类器无引导机制和人类反馈学习实现了一步生成的最优性能。
English Summary: JDM introduces a novel asymmetric distillation approach that enables one-step student models to adapt to new controls unknown to the teacher model, achieving state-of-the-art performance through improved CFG usage and human feedback integration.

Authors:Yuchen Yang, Wei Wang, Yifei Liu, Linfeng Dong, Hao Wu, Mingxin Zhang, Zhihang Zhong, Xiao Sun
Title: SGA-INTERACT: A 3D Skeleton-based Benchmark for Group Activity Understanding in Modern Basketball Tactic
Abstract:
Group Activity Understanding is predominantly studied as Group Activity Recognition (GAR) task. However, existing GAR benchmarks suffer from coarse-grained activity vocabularies and the only data form in single-view, which hinder the evaluation of state-of-the-art algorithms. To address these limitations, we introduce SGA-INTERACT, the first 3D skeleton-based benchmark for group activity understanding. It features complex activities inspired by basketball tactics, emphasizing rich spatial interactions and long-term dependencies. SGA-INTERACT introduces Temporal Group Activity Localization (TGAL) task, extending group activity understanding to untrimmed sequences, filling the gap left by GAR as a standalone task. In addition to the benchmark, we propose One2Many, a novel framework that employs a pretrained 3D skeleton backbone for unified individual feature extraction. This framework aligns with the feature extraction paradigm in RGB-based methods, enabling direct evaluation of RGB-based models on skeleton-based benchmarks. We conduct extensive evaluations on SGA-INTERACT using two skeleton-based methods, three RGB-based methods, and a proposed baseline within the One2Many framework. The general low performance of baselines highlights the benchmark's challenges, motivating advancements in group activity understanding.
中文摘要:作者提出了SGA-INTERACT这一基于3D骨架的群体活动理解基准,通过设计篮球战术启发的复杂互动和时序群体活动定位任务解决现有方法的局限性,同时开发的One2Many框架实现了统一特征提取,基线评估结果凸显了该基准的前沿挑战性。
English Summary: The authors introduce SGA-INTERACT, a 3D skeleton-based benchmark addressing limitations in group activity recognition by featuring complex basketball-inspired interactions and proposing the Temporal Group Activity Localization task, alongside the One2Many framework for unified feature extraction that reveals significant challenges through baseline evaluations.

Authors:Wenhui Zhang, Huiyu Xu, Zhibo Wang, Zeqing He, Ziqi Zhu, Kui Ren
Title: Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
Abstract:
Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs) due to their low computational demands, enhanced privacy guarantees and comparable performance in specific domains through light-weight fine-tuning. Deploying SLMs on edge devices, such as smartphones and smart vehicles, has become a growing trend. However, the security implications of SLMs have received less attention than LLMs, particularly regarding jailbreak attacks, which is recognized as one of the top threats of LLMs by the OWASP. In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks. Through systematically evaluation on 63 SLMs from 15 mainstream SLM families against 8 state-of-the-art jailbreak methods, we demonstrate that 47.6% of evaluated SLMs show high susceptibility to jailbreak attacks (ASR > 40%) and 38.1% of them can not even resist direct harmful query (ASR > 50%). We further analyze the reasons behind the vulnerabilities and identify four key factors: model size, model architecture, training datasets and training techniques. Moreover, we assess the effectiveness of three prompt-level defense methods and find that none of them achieve perfect performance, with detection accuracy varying across different SLMs and attack methods. Notably, we point out that the inherent security awareness play a critical role in SLM security, and models with strong security awareness could timely terminate unsafe response with little reminder. Building upon the findings, we highlight the urgent need for security-by-design approaches in SLM development and provide valuable insights for building more trustworthy SLM ecosystem.
中文: 小语言模型对越狱攻击表现出高度脆弱性,近半数模型易受攻击,其安全性受模型规模、训练数据等因素影响,亟需采用安全设计方法加强防护。
English: Small language models (SLMs) are highly vulnerable to jailbreak attacks, with nearly half showing significant susceptibility, and their security depends on factors like model size and training data, requiring security-by-design approaches for improvement.

Authors:Zhefan Wang, Huanjun Kong, Jie Ying, Wanli Ouyang, Nanqing Dong
Title: ROGRAG: A Robustly Optimized GraphRAG Framework
Abstract:
Large language models (LLMs) commonly struggle with specialized or emerging topics which are rarely seen in the training corpus. Graph-based retrieval-augmented generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval. However, existing pipelines involve complex engineering workflows, making it difficult to isolate the impact of individual components. It is also challenging to evaluate the retrieval effectiveness due to the overlap between the pretraining and evaluation datasets. In this work, we introduce ROGRAG, a Robustly Optimized GraphRAG framework. Specifically, we propose a multi-stage retrieval mechanism that integrates dual-level with logic form retrieval methods to improve retrieval robustness without increasing computational cost. To further refine the system, we incorporate various result verification methods and adopt an incremental database construction approach. Through extensive ablation experiments, we rigorously assess the effectiveness of each component. Our implementation includes comparative experiments on SeedBench, where Qwen2.5-7B-Instruct initially underperformed. ROGRAG significantly improves the score from 60.0% to 75.0% and outperforms mainstream methods. Experiments on domain-specific datasets reveal that dual-level retrieval enhances fuzzy matching, while logic form retrieval improves structured reasoning, highlighting the importance of multi-stage retrieval.ROGRAG is released as an open-source resource and supports installation with pip.
中文摘要:ROGRAG是一个优化后的图检索增强生成框架,通过多阶段检索和验证方法显著提升检索性能,在基准测试中将模型得分从60.0%提高至75.0%。
English Summary: ROGRAG is a robustly optimized GraphRAG framework that enhances retrieval performance through multi-stage retrieval and verification methods, significantly improving model scores from 60.0% to 75.0% on benchmarks.

Authors:Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan
Title: Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
Abstract:
Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
中文: 本研究提出一种训练大语言模型的方法,通过结合学生反应预测与教学评估来生成能提高学生答题正确率的辅导对话,在保证教学质量的同时显著提升学习效果。
English: This study introduces a method to train LLMs for generating tutor responses that enhance student correctness by combining predictive student modeling with pedagogical evaluation, resulting in improved learning outcomes while maintaining teaching quality.

Authors:Zongren Zou, Zhicheng Wang, George Em Karniadakis
Title: Learning and discovering multiple solutions using physics-informed neural networks with random initialization and deep ensemble
Abstract:
We explore the capability of physics-informed neural networks (PINNs) to discover multiple solutions. Many real-world phenomena governed by nonlinear differential equations (DEs), such as fluid flow, exhibit multiple solutions under the same conditions, yet capturing this solution multiplicity remains a significant challenge. A key difficulty is giving appropriate initial conditions or initial guesses, to which the widely used time-marching schemes and Newton's iteration method are very sensitive in finding solutions for complex computational problems. While machine learning models, particularly PINNs, have shown promise in solving DEs, their ability to capture multiple solutions remains underexplored. In this work, we propose a simple and practical approach using PINNs to learn and discover multiple solutions. We first reveal that PINNs, when combined with random initialization and deep ensemble method -- originally developed for uncertainty quantification -- can effectively uncover multiple solutions to nonlinear ordinary and partial differential equations (ODEs/PDEs). Our approach highlights the critical role of initialization in shaping solution diversity, addressing an often-overlooked aspect of machine learning for scientific computing. Furthermore, we propose utilizing PINN-generated solutions as initial conditions or initial guesses for conventional numerical solvers to enhance accuracy and efficiency in capturing multiple solutions. Extensive numerical experiments, including the Allen-Cahn equation and cavity flow, where our approach successfully identifies both stable and unstable solutions, validate the effectiveness of our method. These findings establish a general and efficient framework for addressing solution multiplicity in nonlinear differential equations.
中文: 本研究证明,结合随机初始化和深度集成方法的物理信息神经网络能有效发现非线性微分方程的多个解,同时通过提供更优初始猜测来增强传统数值求解器的性能。
English: This study demonstrates that physics-informed neural networks (PINNs) combined with random initialization and deep ensembles can effectively discover multiple solutions to nonlinear differential equations, while also enhancing traditional numerical solvers by providing improved initial guesses.

Authors:Kedi Xie, Martin Guay, Shimin Wang, Fang Deng, Maobin Lu
Title: Optimal Output Feedback Learning Control for Discrete-Time Linear Quadratic Regulation
Abstract:
This paper studies the linear quadratic regulation (LQR) problem of unknown discrete-time systems via dynamic output feedback learning control. In contrast to the state feedback, the optimality of the dynamic output feedback control for solving the LQR problem requires an implicit condition on the convergence of the state observer. Moreover, due to unknown system matrices and the existence of observer error, it is difficult to analyze the convergence and stability of most existing output feedback learning-based control methods. To tackle these issues, we propose a generalized dynamic output feedback learning control approach with guaranteed convergence, stability, and optimality performance for solving the LQR problem of unknown discrete-time linear systems. In particular, a dynamic output feedback controller is designed to be equivalent to a state feedback controller. This equivalence relationship is an inherent property without requiring convergence of the estimated state by the state observer, which plays a key role in establishing the off-policy learning control approaches. By value iteration and policy iteration schemes, the adaptive dynamic programming based learning control approaches are developed to estimate the optimal feedback control gain. In addition, a model-free stability criterion is provided by finding a nonsingular parameterization matrix, which contributes to establishing a switched iteration scheme. Furthermore, the convergence, stability, and optimality analyses of the proposed output feedback learning control approaches are given. Finally, the theoretical results are validated by two numerical examples.
中文: 本文针对未知离散时间系统的线性二次调节问题,提出了一种广义动态输出反馈学习控制方法,通过将控制器等效设计为状态反馈形式,无需状态观测器收敛即可保证系统的收敛性、稳定性和最优性。
English: This paper introduces a dynamic output feedback learning control method for solving the linear quadratic regulation problem in unknown discrete-time systems, ensuring convergence, stability, and optimality without requiring observer convergence by equivalently designing the controller to function like state feedback.

Authors:Yue Jin, Yongchao Liu, Chuntao Hong
Title: GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs
Abstract:
Graph-based computations are crucial in a wide range of applications, where graphs can scale to trillions of edges. To enable efficient training on such large graphs, mini-batch subgraph sampling is commonly used, which allows training without loading the entire graph into memory. However, existing solutions face significant trade-offs: online subgraph generation, as seen in frameworks like DGL and PyG, is limited to a single machine, resulting in severe performance bottlenecks, while offline precomputed subgraphs, as in GraphGen, improve sampling efficiency but introduce large storage overhead and high I/O costs during training. To address these challenges, we propose \textbf{GraphGen+}, an integrated framework that synchronizes distributed subgraph generation with in-memory graph learning, eliminating the need for external storage while significantly improving efficiency. GraphGen+ achieves a \textbf{27$\times$} speedup in subgraph generation compared to conventional SQL-like methods and a \textbf{1.3$\times$} speedup over GraphGen, supporting training on 1 million nodes per iteration and removing the overhead associated with precomputed subgraphs, making it a scalable and practical solution for industry-scale graph learning.
中文: GraphGen+ 是一种创新框架,通过分布式子图生成与内存学习同步,无需外部存储,效率显著提升,相比类SQL方法提速27倍、超越GraphGen 1.3倍,支持大规模图的可扩展训练。
English: GraphGen+ is an innovative framework that synchronizes distributed subgraph generation with in-memory learning, eliminating storage needs and boosting efficiency with a 27× speedup over SQL-like methods and 1.3× over GraphGen, enabling scalable training on large graphs.

Authors:Xiabao Wu, Yongchao Liu, Wei Qin, Chuntao Hong
Title: Distributed Graph Neural Network Inference With Just-In-Time Compilation For Industry-Scale Graphs
Abstract:
Graph neural networks (GNNs) have delivered remarkable results in various fields. However, the rapid increase in the scale of graph data has introduced significant performance bottlenecks for GNN inference. Both computational complexity and memory usage have risen dramatically, with memory becoming a critical limitation. Although graph sampling-based subgraph learning methods can help mitigate computational and memory demands, they come with drawbacks such as information loss and high redundant computation among subgraphs. This paper introduces an innovative processing paradgim for distributed graph learning that abstracts GNNs with a new set of programming interfaces and leverages Just-In-Time (JIT) compilation technology to its full potential. This paradigm enables GNNs to highly exploit the computational resources of distributed clusters by eliminating the drawbacks of subgraph learning methods, leading to a more efficient inference process. Our experimental results demonstrate that on industry-scale graphs of up to \textbf{500 million nodes and 22.4 billion edges}, our method can produce a performance boost of up to \textbf{27.4 times}.
中文: 本文提出了一种分布式图学习新范式,通过创新编程接口和即时编译技术解决传统图神经网络的计算与内存瓶颈,在亿级规模图上实现了最高27.4倍的性能提升。
English: This paper introduces a distributed graph learning paradigm that uses novel programming interfaces and Just-In-Time compilation to overcome the computational and memory limitations of traditional graph neural networks, achieving up to 27.4 times performance improvement on large-scale graphs.

Authors:Anil Palepu, Valentin Liévin, Wei-Hung Weng, Khaled Saab, David Stutz, Yong Cheng, Kavita Kulkarni, S. Sara Mahdavi, Joëlle Barral, Dale R. Webster, Katherine Chou, Avinatan Hassidim, Yossi Matias, James Manyika, Ryutaro Tanno, Vivek Natarajan, Adam Rodman, Tao Tu, Alan Karthikesalingam, Mike Schaekermann
Title: Towards Conversational AI for Disease Management
Abstract:
While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.
中文: 该研究通过新型基于大语言模型的系统增强了AMIE的临床管理能力,在虚拟考核中其管理推理不逊于初级保健医生,并在治疗精准度和临床指南遵循度方面表现更优。
English: The study enhances AMIE's diagnostic abilities with a new LLM-based system for clinical management, demonstrating non-inferiority to primary care physicians in management reasoning and superior performance in treatment precision and guideline adherence during virtual examinations.

Authors:Shai Bergman, Zhang Ji, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos
Title: Leveraging Approximate Caching for Faster Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, substantially reducing reliance on expensive vector database lookups. To scale efficiently, Proximity employs a locality-sensitive hashing (LSH) scheme that enables fast cache lookups while preserving retrieval accuracy. We evaluate Proximity using the MMLU and MedRAG question answering benchmarks. Our experiments demonstrate that Proximity with our LSH scheme and a realistically skewed MedRAG workload reduces database calls by 78.9% while maintaining database recall and test accuracy. We experiment with different similarity tolerances and cache capacities, and show that the time spent within the Proximity cache remains low and constant (4.8 microseconds) even as the cache grows substantially in size. Our work highlights that approximate caching is a viable and effective strategy for optimizing RAG-based systems.
中文: Proximity是一种近似键值缓存,通过重用相似查询的文档来优化检索增强生成,在保持准确性和召回率的同时,将向量数据库调用减少了78.9%。
English: Proximity is an approximate key-value cache that enhances retrieval-augmented generation by reusing documents for similar queries, reducing vector database calls by 78.9% while maintaining accuracy and recall.

Authors:Akash Dhasade, Anne-Marie Kermarrec, Erick Lavoie, Johan Pouwelse, Rishi Sharma, Martijn de Vos
Title: Practical Federated Learning without a Server
Abstract:
Federated Learning (FL) enables end-user devices to collaboratively train ML models without sharing raw data, thereby preserving data privacy. In FL, a central parameter server coordinates the learning process by iteratively aggregating the trained models received from clients. Yet, deploying a central server is not always feasible due to hardware unavailability, infrastructure constraints, or operational costs. We present Plexus, a fully decentralized FL system for large networks that operates without the drawbacks originating from having a central server. Plexus distributes the responsibilities of model aggregation and sampling among participating nodes while avoiding network-wide coordination. We evaluate Plexus using realistic traces for compute speed, pairwise latency and network capacity. Our experiments on three common learning tasks and with up to 1000 nodes empirically show that Plexus reduces time-to-accuracy by 1.4-1.6x, communication volume by 15.8-292x and training resources needed for convergence by 30.5-77.9x compared to conventional decentralized learning algorithms.
Chinese: Plexus 是一种完全去中心化的联邦学习系统,通过将模型聚合与采样任务分配给各节点来消除中央服务器的依赖,相比传统去中心化算法,在训练效率、通信量和资源消耗方面实现了显著优化。
English: Plexus is a fully decentralized federated learning system that eliminates the need for a central server by distributing model aggregation and sampling tasks among nodes, achieving significant improvements in time-to-accuracy, communication efficiency, and resource utilization compared to traditional decentralized algorithms.

Authors:Siyeop Yoon, Yujin Oh, Matthew Tivnan, Sifan Song, Pengfei Jin, Sekeun Kim, Hyun Jin Cho, Dufan Wu, Raul Uppot, Quanzheng Li
Title: Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching Model
Abstract:
This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery.
中文: 本研究提出了一种三维流匹配模型,利用术中CT成像预测肾脏冷冻消融中冰球的进展,实现了0.61的交并比和0.75的Dice系数的高精度,从而提升微创手术的术中指导和治疗效果。
English: This study introduces a 3D flow-matching model that uses intraoperative CT imaging to predict the progression of the iceball during kidney cryoablation, achieving high accuracy with an IoU of 0.61 and a Dice coefficient of 0.75, thereby enhancing procedural guidance and outcomes in minimally invasive surgery.

Authors:Hyeonseok Moon, Jaehyung Seo, Heuiseok Lim
Title: Call for Rigor in Reporting Quality of Instruction Tuning Data
Abstract:
Instruction tuning is crucial for adapting large language models (LLMs) to align with user intentions. Numerous studies emphasize the significance of the quality of instruction tuning (IT) data, revealing a strong correlation between IT data quality and the alignment performance of LLMs. In these studies, the quality of IT data is typically assessed by evaluating the performance of LLMs trained with that data. However, we identified a prevalent issue in such practice: hyperparameters for training models are often selected arbitrarily without adequate justification. We observed significant variations in hyperparameters applied across different studies, even when training the same model with the same data. In this study, we demonstrate the potential problems arising from this practice and emphasize the need for careful consideration in verifying data quality. Through our experiments on the quality of LIMA data and a selected set of 1,000 Alpaca data points, we demonstrate that arbitrary hyperparameter decisions can make any arbitrary conclusion.
中文: 指令调优数据的质量对于使大语言模型与用户意图对齐至关重要,但训练中任意选择超参数会导致不可靠的结论,这在LIMA和Alpaca数据集的实验中得到了验证。
English: The quality of instruction tuning data is vital for aligning large language models with user intentions, but arbitrary hyperparameter choices in training can lead to unreliable conclusions, as demonstrated in experiments with LIMA and Alpaca datasets.

Authors:Wenjie Qiu, Yi-Chen Li, Xuqin Zhang, Tianyi Zhang, Yihang Zhang, Zongzhang Zhang, Yang Yu
Title: Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
Abstract:
Learning reward models from human preference datasets and subsequently optimizing language models via reinforcement learning has emerged as a fundamental paradigm for aligning LLMs with human preferences. The performance of the reward model plays a crucial role in the effectiveness of alignment. Previous reward models operate at a coarse-grained level, requiring the generation of a complete response to obtain a reward value. The sparse reward may present challenges for downstream reinforcement learning. While recent efforts have attempted to learn token-level reward models, the lack of explicit semantic information makes it difficult to model the credit of every individual token. In this paper, we propose assigning scores to every sentence, introducing an intermediate-grained reward model. By segmenting the complete response into sentences and applying differential operations to reward output at the start and end positions of each sentence, we can effectively model the rewards of sentences. Moreover, a novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score, which allows it to be trained using the Bradley-Terry model. On common benchmarks, our method outperforms the response-level reward model by 2.7% on RewardBench (for reward modeling evaluation) and surpasses all baselines on AlpacaEval (for alignment evaluation).
中文摘要:本文提出了一种中间粒度的奖励模型,通过为语言模型生成的每个句子分配分数,并采用新型注意力机制聚合句级奖励,在主流评测基准上比现有方法取得了更优的对齐效果。
English Summary: This paper introduces an intermediate-grained reward model that assigns scores to individual sentences in language model responses, using a novel attention mechanism to aggregate sentence-level rewards, achieving superior performance on alignment benchmarks compared to existing methods.

Authors:Miao Li, Michael Klamkin, Mathieu Tanneau, Reza Zandehshahvar, Pascal Van Hentenryck
Title: Conformal Prediction with Upper and Lower Bound Models
Abstract:
This paper studies a Conformal Prediction (CP) methodology for building prediction intervals in a regression setting, given only deterministic lower and upper bounds on the target variable. It proposes a new CP mechanism (CPUL) that goes beyond post-processing by adopting a model selection approach over multiple nested interval construction methods. Paradoxically, many well-established CP methods, including CPUL, may fail to provide adequate coverage in regions where the bounds are tight. To remedy this limitation, the paper proposes an optimal thresholding mechanism, OMLT, that adjusts CPUL intervals in tight regions with undercoverage. The combined CPUL-OMLT is validated on large-scale learning tasks where the goal is to bound the optimal value of a parametric optimization problem. The experimental results demonstrate substantial improvements over baseline methods across various datasets.
Chinese: 本文提出了一种新的共形预测方法CPUL和最优阈值机制OMLT,以解决在边界紧密区域覆盖不足的问题,并在有界目标变量的回归任务中显示出显著的改进效果。
English: This paper introduces a new conformal prediction method, CPUL, and an optimal thresholding mechanism, OMLT, to address coverage failures in tight-bound regions, showing significant improvements in regression tasks with bounded target variables.

Authors:Taixian Hou, Yueqi Zhang, Xiaoyi Wei, Zhiyan Dong, Jiafu Yi, Peng Zhai, Lihua Zhang
Title: Music-Driven Legged Robots: Synchronized Walking to Rhythmic Beats
Abstract:
We address the challenge of effectively controlling the locomotion of legged robots by incorporating precise frequency and phase characteristics, which is often ignored in locomotion policies that do not account for the periodic nature of walking. We propose a hierarchical architecture that integrates a low-level phase tracker, oscillators, and a high-level phase modulator. This controller allows quadruped robots to walk in a natural manner that is synchronized with external musical rhythms. Our method generates diverse gaits across different frequencies and achieves real-time synchronization with music in the physical world. This research establishes a foundational framework for enabling real-time execution of accurate rhythmic motions in legged robots. Video is available at website: https://music-walker.github.io/.
中文: 本研究提出了一种结合相位跟踪与调制功能的分层控制器,使足式机器人能够实现与音乐节奏同步的自然步态,并具备多样化实时步态调整能力。
English: This study introduces a hierarchical controller integrating phase tracking and modulation to enable legged robots to achieve natural, musically synchronized locomotion with diverse real-time gait adaptations.

Authors:Jeremy McMahan, Young Wu, Yudong Chen, Xiaojin Zhu, Qiaomin Xie
Title: Optimally Installing Strict Equilibria
Abstract:
In this work, we develop a reward design framework for installing a desired behavior as a strict equilibrium across standard solution concepts: dominant strategy equilibrium, Nash equilibrium, correlated equilibrium, and coarse correlated equilibrium. We also extend our framework to capture the Markov-perfect equivalents of each solution concept. Central to our framework is a comprehensive mathematical characterization of strictly installable, based on the desired solution concept and the behavior's structure. These characterizations lead to efficient iterative algorithms, which we generalize to handle optimization objectives through linear programming. Finally, we explore how our results generalize to bounded rational agents.
中文摘要:本研究提出了一种奖励设计框架,通过数学特征描述和高效算法,将期望行为确立为多种解概念及其马尔可夫完美扩展中的严格均衡。
English Summary: This study introduces a reward design framework that establishes desired behaviors as strict equilibria across various solution concepts and their Markov-perfect extensions, using mathematical characterizations and efficient algorithms for implementation.

Authors:Jiaxin Tu, Xiaoyi Wei, Yueqi Zhang, Taixian Hou, Xiaofei Gao, Zhiyan Dong, Peng Zhai, Lihua Zhang
Title: Continuous Control of Diverse Skills in Quadruped Robots Without Complete Expert Datasets
Abstract:
Learning diverse skills for quadruped robots presents significant challenges, such as mastering complex transitions between different skills and handling tasks of varying difficulty. Existing imitation learning methods, while successful, rely on expensive datasets to reproduce expert behaviors. Inspired by introspective learning, we propose Progressive Adversarial Self-Imitation Skill Transition (PASIST), a novel method that eliminates the need for complete expert datasets. PASIST autonomously explores and selects high-quality trajectories based on predefined target poses instead of demonstrations, leveraging the Generative Adversarial Self-Imitation Learning (GASIL) framework. To further enhance learning, We develop a skill selection module to mitigate mode collapse by balancing the weights of skills with varying levels of difficulty. Through these methods, PASIST is able to reproduce skills corresponding to the target pose while achieving smooth and natural transitions between them. Evaluations on both simulation platforms and the Solo 8 robot confirm the effectiveness of PASIST, offering an efficient alternative to expert-driven learning.
中文:提出的PASIST方法通过对抗性自我模仿和技能选择机制,使四足机器人无需专家数据集即可自主学习多样化技能并实现流畅转换,仿真和实物实验均验证了其有效性。
English: The proposed PASIST method enables quadruped robots to autonomously learn diverse skills and achieve smooth transitions between them without relying on expert datasets, by leveraging adversarial self-imitation and a skill selection mechanism, as validated through simulations and real-world experiments.

Authors:Youngjoon Jang, Jeongsoo Choi, Junseok Ahn, Joon Son Chung
Title: Deep Understanding of Sign Language for Sign to Subtitle Alignment
Abstract:
The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data. To achieve this goal, we propose a novel framework with the following contributions: (1) we leverage fundamental grammatical rules of British Sign Language (BSL) to pre-process the input subtitles, (2) we design a selective alignment loss to optimise the model for predicting the temporal location of signs only when the queried sign actually occurs in a scene, and (3) we conduct self-training with refined pseudo-labels which are more accurate than the heuristic audio-aligned labels. From this, our model not only better understands the correlation between the text and the signs, but also holds potential for application in the translation of sign languages, particularly in scenarios where manual labelling of large-scale sign data is impractical or challenging. Extensive experimental results demonstrate that our approach achieves state-of-the-art results, surpassing previous baselines by substantial margins in terms of both frame-level accuracy and F1-score. This highlights the effectiveness and practicality of our framework in advancing the field of sign language video alignment and translation.
中文: 本研究提出了一种新颖框架,利用英国手语语法规则、选择性对齐损失和优化的伪标签,在有限标注数据下对齐手语视频中的异步字幕,实现了在帧级准确率和F1分数上的最先进性能。
English: This study introduces a novel framework for aligning asynchronous subtitles in sign language videos with limited labeled data, utilizing BSL grammatical rules, a selective alignment loss, and refined pseudo-labels to achieve state-of-the-art accuracy and F1-scores.

Authors:Runlin Lei, Jiarui Ji, Haipeng Ding, Lu Yi, Zhewei Wei, Yongchao Liu, Chuntao Hong
Title: Exploring the Potential of Large Language Models as Predictors in Dynamic Text-Attributed Graphs
Abstract:
With the rise of large language models (LLMs), there has been growing interest in Graph Foundation Models (GFMs) for graph-based tasks. By leveraging LLMs as predictors, GFMs have demonstrated impressive generalizability across various tasks and datasets. However, existing research on LLMs as predictors has predominantly focused on static graphs, leaving their potential in dynamic graph prediction unexplored. In this work, we pioneer using LLMs for predictive tasks on dynamic graphs. We identify two key challenges: the constraints imposed by context length when processing large-scale historical data and the significant variability in domain characteristics, both of which complicate the development of a unified predictor. To address these challenges, we propose the GraphAgent-Dynamic (GAD) Framework, a multi-agent system that leverages collaborative LLMs. In contrast to using a single LLM as the predictor, GAD incorporates global and local summary agents to generate domain-specific knowledge, enhancing its transferability across domains. Additionally, knowledge reflection agents enable adaptive updates to GAD's knowledge, maintaining a unified and self-consistent architecture. In experiments, GAD demonstrates performance comparable to or even exceeds that of full-supervised graph neural networks without dataset-specific training. Finally, to enhance the task-specific performance of LLM-based predictors, we discuss potential improvements, such as dataset-specific fine-tuning to LLMs. By developing tailored strategies for different tasks, we provide new insights for the future design of LLM-based predictors.
中文摘要:本研究开创性地提出GraphAgent-Dynamic(GAD)多智能体框架,通过协同大语言模型解决动态图预测任务,成功克服上下文长度限制和领域差异等关键挑战,在无需特定数据集训练的情况下实现了与全监督图神经网络相媲美甚至更优的性能表现。
English Summary: This study introduces GraphAgent-Dynamic (GAD), a pioneering multi-agent framework that employs collaborative large language models to tackle predictive tasks on dynamic graphs, overcoming challenges like context length limitations and domain variability while achieving performance rivaling supervised graph neural networks without dataset-specific training.

Authors:Awais Nizamani, Hamid Laga, Guanjin Wang, Farid Boussaid, Mohammed Bennamoun, Anuj Srivastava
Title: Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
Abstract:
We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve over time. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics, and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets. The source code and additional results are available at https://4d-dsns.github.io/DSNS/.
中文摘要:本文提出动态球面神经表面(D-SNS)这一连续框架,无需预先离散化即可直接对四维形状进行时空配准、测地线计算和统计均值分析。
English Summary: This paper introduces Dynamic Spherical Neural Surfaces (D-SNS), a continuous framework for analyzing 4D shapes without prior discretization, enabling direct computation of registrations, geodesics, and statistical means.

Authors:Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, Yadan Luo
Title: Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving frameworks enable seamless integration of perception and planning but often rely on one-shot trajectory prediction, which may lead to unstable control and vulnerability to occlusions in single-frame perception. To address this, we propose the Momentum-Aware Driving (MomAD) framework, which introduces trajectory momentum and perception momentum to stabilize and refine trajectory predictions. MomAD comprises two core components: (1) Topological Trajectory Matching (TTM) employs Hausdorff Distance to select the optimal planning query that aligns with prior paths to ensure coherence;(2) Momentum Planning Interactor (MPI) cross-attends the selected planning query with historical queries to expand static and dynamic perception files. This enriched query, in turn, helps regenerate long-horizon trajectory and reduce collision risks. To mitigate noise arising from dynamic environments and detection errors, we introduce robust instance denoising during training, enabling the planning model to focus on critical signals and improve its robustness. We also propose a novel Trajectory Prediction Consistency (TPC) metric to quantitatively assess planning stability. Experiments on the nuScenes dataset demonstrate that MomAD achieves superior long-term consistency (>=3s) compared to SOTA methods. Moreover, evaluations on the curated Turning-nuScenes shows that MomAD reduces the collision rate by 26% and improves TPC by 0.97m (33.45%) over a 6s prediction horizon, while closedloop on Bench2Drive demonstrates an up to 16.3% improvement in success rate.
中文摘要:MomAD框架通过引入轨迹动量和感知动量,利用拓扑轨迹匹配与动量规划交互器来稳定和优化轨迹预测,在nuScenes数据集上展现出更优的长期一致性,并显著降低了碰撞风险。
English Summary: The MomAD framework introduces trajectory and perception momentum to enhance autonomous driving by stabilizing trajectory predictions through topological matching and cross-attention mechanisms, achieving superior long-term consistency and collision reduction in evaluations.

Authors:Ru Ito, Supatta Viriyavisuthisakul, Kazuhiko Kawamoto, Hiroshi Kera
Title: Undertrained Image Reconstruction for Realistic Degradation in Blind Image Super-Resolution
Abstract:
Most super-resolution (SR) models struggle with real-world low-resolution (LR) images. This issue arises because the degradation characteristics in the synthetic datasets differ from those in real-world LR images. Since SR models are trained on pairs of high-resolution (HR) and LR images generated by downsampling, they are optimized for simple degradation. However, real-world LR images contain complex degradation caused by factors such as the imaging process and JPEG compression. Due to these differences in degradation characteristics, most SR models perform poorly on real-world LR images. This study proposes a dataset generation method using undertrained image reconstruction models. These models have the property of reconstructing low-quality images with diverse degradation from input images. By leveraging this property, this study generates LR images with diverse degradation from HR images to construct the datasets. Fine-tuning pre-trained SR models on our generated datasets improves noise removal and blur reduction, enhancing performance on real-world LR images. Furthermore, an analysis of the datasets reveals that degradation diversity contributes to performance improvements, whereas color differences between HR and LR images may degrade performance. 11 pages, (11 figures and 2 tables)
中文: 本研究提出了一种利用训练不足的图像重建模型生成具有多样化退化的低分辨率图像的数据集构建方法,通过微调预训练超分辨率模型,有效提升了真实世界图像的噪声去除和模糊减少性能。
English: This study introduces a dataset generation method using undertrained image reconstruction models to create low-resolution images with diverse degradation, enabling fine-tuning of super-resolution models for improved performance on real-world images by enhancing noise removal and blur reduction.

Authors:Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao
Title: Iterative Value Function Optimization for Guided Decoding
Abstract:
While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.
中文摘要:迭代价值优化框架通过价值探索与迭代自优化的结合,有效解决了引导式解码中的分布差距问题,在实现语言模型对齐的同时显著降低了计算成本。
English Summary: Iterative Value Refinement is a novel framework that bridges the distributional gap in value-guided decoding by combining value exploration with iterative self-refinement, effectively aligning language models while reducing computational costs.

Authors:Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao
Title: Evolutionary Guided Decoding: Iterative Value Refinement for LLMs
Abstract:
While guided decoding, especially value-guided methods, has emerged as a cost-effective alternative for controlling language model outputs without re-training models, its effectiveness is limited by the accuracy of the value function. We identify that this inaccuracy stems from a core distributional gap: existing methods train static value functions on trajectories sampled exclusively from the base policy, which inherently confines their training to a narrow and suboptimal view of the potential output space. We propose Iterative Value Refinement, a novel framework designed to bridge this gap. It employs Value Exploration to provide a more comprehensive and robust training signal, complemented by Iterative Self-Refinement, which uses the improved value function from one iteration to guide the generation of higher-quality data for the next. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of our framework in aligning language models. Our approach not only achieves alignment but also significantly reduces computational costs by leveraging principled value function optimization for efficient and effective control.
中文摘要:迭代价值优化框架通过价值探索与迭代自优化的结合,有效解决了引导式解码中的分布差距问题,在实现语言模型对齐的同时显著降低了计算成本。
English Summary: Iterative Value Refinement is a novel framework that bridges the distributional gap in value-guided decoding by combining value exploration with iterative self-refinement, effectively aligning language models while reducing computational costs.

Authors:Guangyin Bao, Qi Zhang, Zixuan Gong, Zhuojia Wu, Duoqian Miao
Title: MindSimulator: Exploring Brain Concept Localization via Synthetic FMRI
Abstract:
Concept-selective regions within the human cerebral cortex exhibit significant activation in response to specific visual stimuli associated with particular concepts. Precisely localizing these regions stands as a crucial long-term goal in neuroscience to grasp essential brain functions and mechanisms. Conventional experiment-driven approaches hinge on manually constructed visual stimulus collections and corresponding brain activity recordings, constraining the support and coverage of concept localization. Additionally, these stimuli often consist of concept objects in unnatural contexts and are potentially biased by subjective preferences, thus prompting concerns about the validity and generalizability of the identified regions. To address these limitations, we propose a data-driven exploration approach. By synthesizing extensive brain activity recordings, we statistically localize various concept-selective regions. Our proposed MindSimulator leverages advanced generative technologies to learn the probability distribution of brain activity conditioned on concept-oriented visual stimuli. This enables the creation of simulated brain recordings that reflect real neural response patterns. Using the synthetic recordings, we successfully localize several well-studied concept-selective regions and validate them against empirical findings, achieving promising prediction accuracy. The feasibility opens avenues for exploring novel concept-selective regions and provides prior hypotheses for future neuroscience research.
中文: 该研究提出MindSimulator,一种数据驱动方法,利用生成模型从基于概念的视觉刺激创建模拟脑活动记录,从而准确定位和验证概念选择性脑区,同时克服传统方法的偏见问题。
English: The study introduces MindSimulator, a data-driven method using generative models to create simulated brain activity recordings from concept-based visual stimuli, enabling accurate localization and validation of concept-selective brain regions while overcoming biases in traditional approaches.

Authors:Ganghui Cao, Shimin Wang, Martin Guay, Jinzhi Wang, Zhisheng Duan, Marios M. Polycarpou
Title: Deficient Excitation in Parameter Learning
Abstract:
This paper investigates parameter learning problems under deficient excitation (DE). The DE condition is a rank-deficient, and therefore, a more general evolution of the well-known persistent excitation condition. Under the DE condition, a proposed online algorithm is able to calculate the identifiable and non-identifiable subspaces, and finally give an optimal parameter estimate in the sense of least squares. In particular, the learning error within the identifiable subspace exponentially converges to zero in the noise-free case, even without persistent excitation. The DE condition also provides a new perspective for solving distributed parameter learning problems, where the challenge is posed by local regressors that are often insufficiently excited. To improve knowledge of the unknown parameters, a cooperative learning protocol is proposed for a group of estimators that collect measured information under complementary DE conditions. This protocol allows each local estimator to operate locally in its identifiable subspace, and reach a consensus with neighbours in its non-identifiable subspace. As a result, the task of estimating unknown parameters can be achieved in a distributed way using cooperative local estimators. Application examples in system identification are given to demonstrate the effectiveness of the theoretical results developed in this paper.
中文: 本文提出了一种在不足激励条件下的参数学习在线算法,能够识别并分离可辨识与不可辨识子空间以获得最优最小二乘估计,在无噪声情况下实现指数收敛,并应用于分布式协同学习。
English: This paper introduces an online algorithm for parameter learning under deficient excitation, which identifies and separates identifiable and non-identifiable subspaces to achieve optimal least squares estimates, with exponential convergence in noise-free cases and applications in distributed cooperative learning.

Authors:Hang Xu, Tailong Xiao, Jingzheng Huang, Ming He, Jianping Fan, Guihua Zeng
Title: Towards Heisenberg limit without critical slowing down via quantum reinforcement learning
Abstract:
Critical ground states of quantum many-body systems have emerged as vital resources for quantum-enhanced sensing. Traditional methods to prepare these states often rely on adiabatic evolution, which may diminish the quantum sensing advantage. In this work, we propose a quantum reinforcement learning (QRL)-enhanced critical sensing protocol for quantum many-body systems with exotic phase diagrams. Starting from product states and utilizing QRL-discovered gate sequences, we explore sensing accuracy in the presence of unknown external magnetic fields, covering both local and global regimes. Our results demonstrate that QRL-learned sequences reach the finite quantum speed limit and generalize effectively across systems of arbitrary size, ensuring accuracy regardless of preparation time. This method can robustly achieve Heisenberg and super-Heisenberg limits, even in noisy environments with practical Pauli measurements. Our study highlights the efficacy of QRL in enabling precise quantum state preparation, thereby advancing scalable, high-accuracy quantum critical sensing.
中文: 量子强化学习能够有效制备临界量子态,在噪声环境和任意系统尺寸下实现海森堡及超海森堡极限的高精度量子传感。
English: Quantum reinforcement learning enables efficient preparation of critical quantum states, achieving high-precision sensing with Heisenberg and super-Heisenberg limits even under noise and arbitrary system sizes.

Authors:Sheng Yue, Zerui Qin, Yongheng Deng, Ju Ren, Yaoxue Zhang, Junshan Zhang
Title: AugFL: Augmenting Federated Learning with Pretrained Models
Abstract:
Federated Learning (FL) has garnered widespread interest in recent years. However, owing to strict privacy policies or limited storage capacities of training participants such as IoT devices, its effective deployment is often impeded by the scarcity of training data in practical decentralized learning environments. In this paper, we study enhancing FL with the aid of (large) pre-trained models (PMs), that encapsulate wealthy general/domain-agnostic knowledge, to alleviate the data requirement in conducting FL from scratch. Specifically, we consider a networked FL system formed by a central server and distributed clients. First, we formulate the PM-aided personalized FL as a regularization-based federated meta-learning problem, where clients join forces to learn a meta-model with knowledge transferred from a private PM stored at the server. Then, we develop an inexact-ADMM-based algorithm, AugFL, to optimize the problem with no need to expose the PM or incur additional computational costs to local clients. Further, we establish theoretical guarantees for AugFL in terms of communication complexity, adaptation performance, and the benefit of knowledge transfer in general non-convex cases. Extensive experiments corroborate the efficacy and superiority of AugFL over existing baselines.
中文:联邦学习通过利用预训练模型减少数据依赖,并采用名为AugFL的新算法提高效率,该算法在保障隐私和计算经济性的同时实现了卓越性能。
English: Federated Learning is enhanced by leveraging pre-trained models to reduce data dependency and improve efficiency through a novel algorithm called AugFL, which ensures privacy and computational economy while achieving superior performance.

Authors:Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, Yibo Zhong
Title: Building Machine Learning Challenges for Anomaly Detection in Science
Abstract:
Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.
Chinese: 科学发现常源于对挑战现有规则之异常的探测,本文介绍了来自天体物理学、基因组学和极地科学的三个数据集,旨在开发可识别此类异常并推动科学发现的FAIR机器学习模型。
English: Scientific discoveries often arise from detecting anomalies that challenge existing rules, and this paper introduces three datasets from astrophysics, genomics, and polar science to develop FAIR machine learning models for identifying such outliers and advancing discovery.

Authors:Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su
Title: Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
Abstract:
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
中文:DEMO3框架通过将长时程机器人操作任务分解为子目标,并融合多阶段密集奖励学习、双阶段训练方案与世界模型学习,在稀疏奖励环境中实现了高达70%的数据效率提升。
English: The DEMO3 framework addresses the challenges of long-horizon robotic manipulation tasks by decomposing them into subgoals and integrating multi-stage dense reward learning, bi-phasic training, and world model learning, achieving up to 70% higher data efficiency in sparse-reward environments.

Authors:Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su
Title: Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
Abstract:
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
中文:DEMO3框架通过将长时程机器人操作任务分解为子目标,并融合多阶段密集奖励学习、双阶段训练方案与世界模型学习,在稀疏奖励环境中实现了高达70%的数据效率提升。
English: The DEMO3 framework addresses the challenges of long-horizon robotic manipulation tasks by decomposing them into subgoals and integrating multi-stage dense reward learning, bi-phasic training, and world model learning, achieving up to 70% higher data efficiency in sparse-reward environments.

Authors:Wenjie Wu, Yongcheng Jing, Yingjie Wang, Wenbin Hu, Dacheng Tao
Title: Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning
Abstract:
Recent large language model (LLM) reasoning, despite its success, suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth, particularly in small-scale models deployed in resource-constrained environments. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning to address these challenges, introducing a novel paradigm termed as graph-augmented reasoning. Our goal is to enable frozen, small-scale LLMs to retrieve and process relevant mathematical knowledge in a step-wise manner, enhancing their problem-solving abilities without additional training. To this end, we propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model (PRP-RM) that refines retrieved information and evaluates each reasoning step. Experiments on the Math500 and GSM8K benchmarks across six models demonstrate that KG-RAR yields encouraging results, achieving a 20.73\% relative improvement with Llama-3B on Math500.
中文摘要:本文提出KG-RAR框架,通过将逐步知识图谱检索与推理相结合,增强小规模大语言模型的数学解题能力,无需额外训练即可实现显著性能提升。
English Summary: This paper introduces KG-RAR, a graph-augmented reasoning framework that enhances small-scale LLMs' mathematical problem-solving by integrating step-wise knowledge graph retrieval and reasoning, achieving significant performance improvements without additional training.

Authors:Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, Hui Xiong
Title: SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction
Abstract:
Large Language Models (LLMs) have demonstrated improved generation performance by incorporating externally retrieved knowledge, a process known as retrieval-augmented generation (RAG). Despite the potential of this approach, existing studies evaluate RAG effectiveness by 1) assessing retrieval and generation components jointly, which obscures retrieval's distinct contribution, or 2) examining retrievers using traditional metrics such as NDCG, which creates a gap in understanding retrieval's true utility in the overall generation process. To address the above limitations, in this work, we introduce an automatic evaluation method that measures retrieval quality through the lens of information gain within the RAG framework. Specifically, we propose Semantic Perplexity (SePer), a metric that captures the LLM's internal belief about the correctness of the retrieved information. We quantify the utility of retrieval by the extent to which it reduces semantic perplexity post-retrieval. Extensive experiments demonstrate that SePer not only aligns closely with human preferences but also offers a more precise and efficient evaluation of retrieval utility across diverse RAG scenarios.
中文: 本研究提出了语义困惑度(SePer)这一自动评估指标,通过量化检索信息降低大语言模型对正确性的不确定程度来衡量检索增强生成中的检索质量,实验表明该指标与人类偏好高度一致,并在多种场景下提供更精准高效的评估。
English: This study introduces Semantic Perplexity (SePer), an automatic evaluation metric that measures retrieval quality in retrieval-augmented generation by quantifying how much retrieved information reduces an LLM's uncertainty about correctness, showing strong alignment with human judgment and improved precision across diverse scenarios.

Authors:Chong Wang, Lanqing Guo, Zixuan Fu, Siyuan Yang, Hao Cheng, Alex C. Kot, Bihan Wen
Title: Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual
Abstract:
Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned \textit{discriminative denoiser} as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained \textit{generative diffusion model}, has gained great popularity for solving IR problems through stochastic sampling. The IR results using PnP with a pre-trained diffusion model demonstrate distinct advantages compared to those using discriminative denoisers, \ie improved perceptual quality while sacrificing the data fidelity. The unsatisfactory results are due to the lack of integration of these strategies in the IR tasks. In this work, we propose a novel zero-shot IR scheme, dubbed Reconciling Diffusion Model in Dual (RDMD), which leverages only a \textbf{single} pre-trained diffusion model to construct \textbf{two} complementary regularizers. Specifically, the diffusion model in RDMD will iteratively perform deterministic denoising and stochastic sampling, aiming to achieve high-fidelity image restoration with appealing perceptual quality. RDMD also allows users to customize the distortion-perception tradeoff with a single hyperparameter, enhancing the adaptability of the restoration process in different practical scenarios. Extensive experiments on several IR tasks demonstrate that our proposed method could achieve superior results compared to existing approaches on both the FFHQ and ImageNet datasets.
中文摘要:提出的RDMD方法利用单个预训练扩散模型构建双重正则化器,通过迭代执行确定性去噪和随机采样,在多个数据集上实现了可定制失真-感知平衡的优越图像恢复效果。
English Summary: The proposed RDMD method uses a single pre-trained diffusion model to create dual regularizers that iteratively perform deterministic denoising and stochastic sampling, achieving superior image restoration with customizable distortion-perception tradeoff across multiple datasets.

Authors:Liping Liu, Chunhong Zhang, Likang Wu, Chuang Zhao, Zheng Hu, Ming He, Jianping Fan
Title: Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction
Abstract:
Self-reflection for Large Language Models (LLMs) has gained significant attention. Existing approaches involve models iterating and improving their previous responses based on LLMs' internal reflection ability or external feedback. However, recent research has raised doubts about whether intrinsic self-correction without external feedback may even degrade performance. Based on our empirical evidence, we find that current static reflection methods may lead to redundant, drift, and stubborn issues. To mitigate this, we introduce Instruct-of-Reflection (IoRT), a novel and general reflection framework that leverages dynamic-meta instruction to enhance the iterative reflection capability of LLMs. Specifically, we propose the instructor driven by the meta-thoughts and self-consistency classifier, generates various instructions, including refresh, stop, and select, to guide the next reflection iteration. Our experiments demonstrate that IoRT achieves an average improvement of 10.1% over established baselines in mathematical and commonsense reasoning tasks, highlighting its efficacy and applicability.
中文摘要:Instruct-of-Reflection(IoRT)框架通过动态元指令引导反思迭代,有效解决了当前大型语言模型静态反思方法存在的冗余和僵化问题,在推理任务中实现了10.1%的平均性能提升。
English Summary: The Instruct-of-Reflection (IoRT) framework addresses limitations in current static self-reflection methods for LLMs by introducing dynamic-meta instructions that guide reflection iterations, achieving a 10.1% average improvement in reasoning tasks.

Authors:Zikuan Li, Honghua Chen, Yuecheng Wang, Sibo Wu, Mingqiang Wei, Jun Wang
Title: STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds
Abstract:
Extracting geometric edges from unstructured point clouds remains a significant challenge, particularly in thin-walled structures that are commonly found in everyday objects. Traditional geometric methods and recent learning-based approaches frequently struggle with these structures, as both rely heavily on sufficient contextual information from local point neighborhoods. However, 3D measurement data of thin-walled structures often lack the accurate, dense, and regular neighborhood sampling required for reliable edge extraction, resulting in degraded performance. In this work, we introduce STAR-Edge, a novel approach designed for detecting and refining edge points in thin-walled structures. Our method leverages a unique representation-the local spherical curve-to create structure-aware neighborhoods that emphasize co-planar points while reducing interference from close-by, non-co-planar surfaces. This representation is transformed into a rotation-invariant descriptor, which, combined with a lightweight multi-layer perceptron, enables robust edge point classification even in the presence of noise and sparse or irregular sampling. Besides, we also use the local spherical curve representation to estimate more precise normals and introduce an optimization function to project initially identified edge points exactly on the true edges. Experiments conducted on the ABC dataset and thin-walled structure-specific datasets demonstrate that STAR-Edge outperforms existing edge detection methods, showcasing better robustness under various challenging conditions.
中文: STAR-Edge提出了一种创新方法,利用局部球面曲线构建结构感知邻域和旋转不变描述符,在薄壁结构中实现稳健的边缘检测,在多种挑战性条件下优于现有方法。
English: STAR-Edge introduces a novel method using local spherical curves to create structure-aware neighborhoods and rotation-invariant descriptors for robust edge detection in thin-walled structures, outperforming existing approaches under challenging conditions.

Authors:Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
Title: UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Abstract:
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.
中文: 本研究提出UniWav这一统一的语音预训练编码器-解码器框架,能同时处理判别式和生成式任务,在保持与专业模型相当性能的同时显著降低了预训练成本。
English: This work introduces UniWav, a unified encoder-decoder framework for speech pre-training that effectively handles both discriminative and generative tasks, achieving comparable performance to specialized models while reducing pre-training overhead.

Authors:Zhiwei Ling, Yachen Chang, Hailiang Zhao, Xinkui Zhao, Kingsum Chow, Shuiguang Deng
Title: CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging
Abstract:
Deep neural networks (DNNs) have been widely criticized for their overconfidence when dealing with out-of-distribution (OOD) samples, highlighting the critical need for effective OOD detection to ensure the safe deployment of DNNs in real-world settings. Existing post-hoc OOD detection methods primarily enhance the discriminative power of logit-based approaches by reshaping sample features, yet they often neglect critical information inherent in the features themselves. In this paper, we propose the Class-Aware Relative Feature-based method (CARef), which utilizes the error between a sample's feature and its class-aware average feature as a discriminative criterion. To further refine this approach, we introduce the Class-Aware Decoupled Relative Feature-based method (CADRef), which decouples sample features based on the alignment of signs between the relative feature and corresponding model weights, enhancing the discriminative capabilities of CARef. Extensive experimental results across multiple datasets and models demonstrate that both proposed methods exhibit effectiveness and robustness in OOD detection compared to state-of-the-art methods. Specifically, our two methods outperform the best baseline by 2.82% and 3.27% in AUROC, with improvements of 4.03% and 6.32% in FPR95, respectively.
中文: 本文提出的CARef和CADRef两种基于类别感知相对特征的方法,通过利用特征误差和解耦技术显著提升了分布外检测性能,在多个基准测试中均展现出优于现有方法的有效性。
English: The paper introduces CARef and CADRef, two class-aware relative feature-based methods that significantly improve out-of-distribution detection by leveraging feature errors and decoupling techniques, demonstrating superior performance over existing approaches in multiple benchmarks.

Authors:Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arróyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
Title: Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Abstract:
We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.
中文: Mat2Seq是一种创新方法,可将三维晶体结构转化为独特的一维序列,确保SE(3)和周期性不变性,从而在使用语言模型生成晶体结构方面优于现有方法。
English: Mat2Seq is a novel method that converts 3D crystal structures into unique 1D sequences, ensuring SE(3) and periodic invariance for improved crystal generation using language models.

Authors:Hao Wang, Ligong Han, Kai Xu, Akash Srivastava
Title: SQuat: Subspace-orthogonal KV Cache Quantization
Abstract:
The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism's outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
中文: SQuat提出了一种基于子空间正交化的KV缓存量化方法,通过使量化误差与查询张量子空间正交来最小化对注意力机制的影响,无需模型微调或额外数据即可显著降低内存占用并提升处理速度。
English: SQuat introduces a novel subspace-orthogonal quantization method for KV cache compression that minimizes quantization errors by aligning key tensor differences orthogonal to a query-based subspace, achieving significant memory reduction and throughput improvement without requiring fine-tuning or additional datasets.

Authors:Wesley A. Suttle, Jesse Milzman, Mustafa O. Karabag, Brian M. Sadler, Ufuk Topcu
Title: Value of Information-based Deceptive Path Planning Under Adversarial Interventions
Abstract:
Existing methods for deceptive path planning (DPP) address the problem of designing paths that conceal their true goal from a passive, external observer. Such methods do not apply to problems where the observer has the ability to perform adversarial interventions to impede the path planning agent. In this paper, we propose a novel Markov decision process (MDP)-based model for the DPP problem under adversarial interventions and develop new value of information (VoI) objectives to guide the design of DPP policies. Using the VoI objectives we propose, path planning agents deceive the adversarial observer into choosing suboptimal interventions by selecting trajectories that are of low informational value to the observer. Leveraging connections to the linear programming theory for MDPs, we derive computationally efficient solution methods for synthesizing policies for performing DPP under adversarial interventions. In our experiments, we illustrate the effectiveness of the proposed solution method in achieving deceptiveness under adversarial interventions and demonstrate the superior performance of our approach to both existing DPP methods and conservative path planning approaches on illustrative gridworld problems.
Chinese: 本文提出了一种新颖的马尔可夫决策过程模型,用于对抗性干预下的欺骗路径规划,通过信息价值目标误导观察者采取次优干预,并利用高效计算方法在实验中展现了优越性能。
English: This paper introduces a novel Markov decision process model for deceptive path planning under adversarial interventions, using value of information objectives to mislead observers into suboptimal actions and demonstrating superior performance through efficient computational methods.

Authors:Dun Yuan, Hao Zhou, Di Wu, Xue Liu, Hao Chen, Yan Xin, Jianzhong, Zhang
Title: Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks. However, LLMs are still facing challenges when applied to domain-specific areas like telecommunications, which demands specialized expertise and adaptability to evolving standards. This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain. The framework leverages a KG to capture structured, domain-specific information about network protocols, standards, and other telecom-related entities, comprehensively representing their relationships. By integrating KG with RAG, LLMs can dynamically access and utilize the most relevant and up-to-date knowledge during response generation. This hybrid approach bridges the gap between structured knowledge representation and the generative capabilities of LLMs, significantly enhancing accuracy, adaptability, and domain-specific comprehension. Our results demonstrate the effectiveness of the KG-RAG framework in addressing complex technical queries with precision. The proposed KG-RAG model attained an accuracy of 88% for question answering tasks on a frequently used telecom-specific dataset, compared to 82% for the RAG-only and 48% for the LLM-only approaches.
中文: 本文提出一种KG-RAG框架,通过将知识图谱与检索增强生成技术相结合,使大语言模型能动态获取电信领域最新专业知识,在技术问答任务中实现88%的准确率,显著提升了领域适应能力。
English: This paper introduces a KG-RAG framework that integrates knowledge graphs with retrieval-augmented generation to enhance LLMs' accuracy and adaptability in the telecom domain, achieving 88% QA accuracy by leveraging structured, up-to-date technical knowledge.

Authors:Shanze Wang, Mingao Tan, Zhibo Yang, Biao Huang, Xiaoyu Shen, Hailong Huang, Wei Zhang
Title: MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation
Abstract:
Deep Reinforcement Learning (DRL) based navigation methods have demonstrated promising results for mobile robots, but suffer from limited action flexibility in confined spaces. Conventional DRL approaches predominantly learn forward-motion policies, causing robots to become trapped in complex environments where backward maneuvers are necessary for recovery. This paper presents MAER-Nav (Mirror-Augmented Experience Replay for Robot Navigation), a novel framework that enables bidirectional motion learning without requiring explicit failure-driven hindsight experience replay or reward function modifications. Our approach integrates a mirror-augmented experience replay mechanism with curriculum learning to generate synthetic backward navigation experiences from successful trajectories. Experimental results in both simulation and real-world environments demonstrate that MAER-Nav significantly outperforms state-of-the-art methods while maintaining strong forward navigation capabilities. The framework effectively bridges the gap between the comprehensive action space utilization of traditional planning methods and the environmental adaptability of learning-based approaches, enabling robust navigation in scenarios where conventional DRL methods consistently fail.
中文摘要:MAER-Nav框架通过镜像增强经验回放与课程学习机制,在不修改奖励函数的情况下实现双向运动学习,显著提升了机器人在复杂受限空间中的导航能力,同时保持优异的正向导航性能。
English Summary: MAER-Nav is a novel DRL framework that enhances robot navigation in confined spaces by enabling bidirectional motion learning through mirror-augmented experience replay and curriculum learning, outperforming existing methods while preserving forward navigation capabilities.

Authors:Lucas Heublein, Nisha L. Raichur, Tobias Feigl, Tobias Brieger, Fin Heuer, Lennart Asbach, Alexander Rügamer, Felix Ott
Title: Evaluation of (Un-)Supervised Machine Learning Methods for GNSS Interference Classification with Real-World Data Discrepancies
Abstract:
The accuracy and reliability of vehicle localization on roads are crucial for applications such as self-driving cars, toll systems, and digital tachographs. To achieve accurate positioning, vehicles typically use global navigation satellite system (GNSS) receivers to validate their absolute positions. However, GNSS-based positioning can be compromised by interference signals, necessitating the identification, classification, determination of purpose, and localization of such interference to mitigate or eliminate it. Recent approaches based on machine learning (ML) have shown superior performance in monitoring interference. However, their feasibility in real-world applications and environments has yet to be assessed. Effective implementation of ML techniques requires training datasets that incorporate realistic interference signals, including real-world noise and potential multipath effects that may occur between transmitter, receiver, and satellite in the operational area. Additionally, these datasets require reference labels. Creating such datasets is often challenging due to legal restrictions, as causing interference to GNSS sources is strictly prohibited. Consequently, the performance of ML-based methods in practical applications remains unclear. To address this gap, we describe a series of large-scale measurement campaigns conducted in real-world settings at two highway locations in Germany and the Seetal Alps in Austria, and in large-scale controlled indoor environments. We evaluate the latest supervised ML-based methods to report on their performance in real-world settings and present the applicability of pseudo-labeling for unsupervised learning. We demonstrate the challenges of combining datasets due to data discrepancies and evaluate outlier detection, domain adaptation, and data augmentation techniques to present the models' capabilities to adapt to changes in the datasets.
Chinese: 基于GNSS的车辆精确定位对自动驾驶至关重要,但易受干扰影响,机器学习方法虽在监测干扰方面表现优异,其实用性仍需通过现实数据集验证,而此类数据因法律限制难以获取。
English: Accurate vehicle localization via GNSS is vital for autonomous driving but faces interference challenges, which recent machine learning methods show promise in monitoring, though their real-world applicability requires further evaluation using realistic datasets that are difficult to obtain due to legal constraints.

Authors:Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang
Title: From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Abstract:
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
中文: 针对现有LVLMs在空间感知上的不足,我们构建了新型2D空间数据生成流程,开发了SPAR-7M数据集和SPAR-Bench基准测试,使模型在空间推理任务中达到最优性能。
English: Recent LVLMs still face spatial perception challenges, so we developed a novel 2D spatial data pipeline to create the SPAR-7M dataset and SPAR-Bench benchmark, enabling models to achieve state-of-the-art performance on spatial reasoning tasks.

Authors:Mingyuan Zhang, Yue Bai, Huan Wang, Yizhou Wang, Qihua Dong, Yun Fu
Title: Boosting Large Language Models with Mask Fine-Tuning
Abstract:
The model is usually kept integral in the mainstream large language model (LLM) fine-tuning protocols. No works have questioned whether maintaining the integrity of the model is indispensable for performance. In this work, we introduce Mask Fine-Tuning (MFT), a brand-new LLM fine-tuning paradigm to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Specifically, MFT learns a set of binary masks supervised by the typical LLM fine-tuning objective. Extensive experiments show that MFT gains a consistent performance boost across various domains and backbones (e.g., 1.95%/1.88% average gain in coding with LLaMA2-7B/3.1-8B). Detailed procedures are provided to study the proposed MFT from different hyperparameter perspectives for better insight. In particular, MFT naturally updates the current LLM training protocol by deploying it on a complete well-trained model. This study extends the functionality of mask learning from its conventional network pruning context for model compression to a more general scope.
Chinese: 本文提出掩码微调(MFT)新范式,通过打破模型完整性提升性能,在多个领域和骨干网络中均实现稳定性能提升。
English: This paper introduces Mask Fine-Tuning (MFT), a novel paradigm that breaks model integrity to enhance performance, demonstrating consistent gains across domains and backbones.

Authors:Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, Yangguang Li
Title: SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
Abstract:
Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^3$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
中文: 本文提出SparseFlex稀疏结构等值面表示法,可直接通过渲染损失实现高分辨率可微分网格重建,在显著降低计算成本和内存消耗的同时达到了最先进的精度水平。
English: This paper introduces SparseFlex, a sparse-structured isosurface representation that enables high-resolution differentiable mesh reconstruction directly from rendering losses, achieving state-of-the-art accuracy with significantly reduced computational costs and memory usage.

Authors:Xiaoming Xue, Liang Feng, Yinglan Feng, Rui Liu, Kai Zhang, Kay Chen Tan
Title: A Theoretical Analysis of Analogy-Based Evolutionary Transfer Optimization
Abstract:
Evolutionary transfer optimization (ETO) has been gaining popularity in research over the years due to its outstanding knowledge transfer ability to address various challenges in optimization. However, a pressing issue in this field is that the invention of new ETO algorithms has far outpaced the development of fundamental theories needed to clearly understand the key factors contributing to the success of these algorithms for effective generalization. In response to this challenge, this study aims to establish theoretical foundations for analogy-based ETO, specifically to support various algorithms that frequently reference a key concept known as similarity. First, we introduce analogical reasoning and link its subprocesses to three key issues in ETO. Then, we develop theories for analogy-based knowledge transfer, rooted in the principles that underlie the subprocesses. Afterwards, we present two theorems related to the performance gain of analogy-based knowledge transfer, namely unconditionally nonnegative performance gain and conditionally positive performance gain, to theoretically demonstrate the effectiveness of various analogy-based ETO methods. Last but not least, we offer a novel insight into analogy-based ETO that interprets its conditional superiority over traditional evolutionary optimization through the lens of the no free lunch theorem for optimization.
中文摘要:本研究通过将类比推理子过程与进化迁移优化的关键问题相联系,建立了基于类比的ETO理论基础,提出迁移理论并通过性能增益定理证明其有效性,同时借助优化无免费午餐定理提供了新颖的理论视角。
English Summary: This study establishes theoretical foundations for analogy-based evolutionary transfer optimization (ETO) by linking analogical reasoning subprocesses to key ETO issues, developing transfer theories, and demonstrating effectiveness through performance gain theorems while offering novel insights via the no free lunch theorem.

Authors:Ziyi Zhou, Xiaoming Zhang, Shenghan Tan, Litian Zhang, Chaozhuo Li
Title: Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News Detection
Abstract:
The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4\% and 12.8\% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.
中文摘要:本文提出的多轮协作检测框架通过整合大小语言模型的优势,采用动态知识检索和多轮验证机制,在真实数据集上实现了7.4%-12.8%的假新闻检测准确率提升。
English Summary: The proposed Multi-Round Collaboration Detection (MRCD) framework effectively combines the strengths of large and small language models to significantly improve fake news detection accuracy by 7.4-12.8% through dynamic knowledge retrieval and multi-round verification.

Authors:Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, Jürgen Schmidhuber, Bernard Ghanem
Title: Can Video Diffusion Model Reconstruct 4D Geometry?
Abstract:
Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.
Chinese: Sora3R是一种创新框架,利用大规模视频扩散模型直接从单目视频推断4D点云图,无需外部模块或迭代对齐即可实现高效精准的动态三维场景重建。
English: Sora3R is a novel framework that leverages large-scale video diffusion models to directly infer 4D pointmaps from monocular videos, enabling efficient and accurate dynamic 3D scene reconstruction without requiring external modules or iterative alignment.

Authors:Arnav Arora, Srishti Yadav, Maria Antoniak, Serge Belongie, Isabelle Augenstein
Title: Multi-Modal Framing Analysis of News
Abstract:
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-) language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
中文: 本研究提出了一种利用大型视觉语言模型的多模态多标签方法,通过整合文本和图像来分析新闻中的政治框架,克服了以往仅依赖文本的局限,为检测媒体偏见提供了更全面的工具。
English: This study introduces a multi-modal, multi-label method using large vision-language models to analyze political framing in news by integrating both text and images, overcoming previous limitations of text-only approaches and providing a comprehensive tool for detecting media bias.

Authors:Xueyin Li, Xinkai Jiang, Philipp Dahlinger, Gerhard Neumann, Rudolf Lioutikov
Title: Beyond Visuals: Investigating Force Feedback in Extended Reality for Robot Data Collection
Abstract:
This work explores how force feedback affects various aspects of robot data collection within the Extended Reality (XR) setting. Force feedback has been proved to enhance the user experience in Extended Reality (XR) by providing contact-rich information. However, its impact on robot data collection has not received much attention in the robotics community. This paper addresses this shortcoming by conducting an extensive user study on the effects of force feedback during data collection in XR. We extended two XR-based robot control interfaces, Kinesthetic Teaching and Motion Controllers, with haptic feedback features. The user study is conducted using manipulation tasks ranging from simple pick-place to complex peg assemble, requiring precise operations. The evaluations show that force feedback enhances task performance and user experience, particularly in tasks requiring high-precision manipulation. These improvements vary depending on the robot control interface and task complexity. This paper provides new insights into how different factors influence the impact of force feedback.
中文: 本研究探讨了力反馈在扩展现实中如何提升机器人数据收集,发现它能显著改善任务表现和用户体验,尤其在精密操作中,其效果因控制界面和任务复杂度而异。
English: This study investigates how force feedback enhances robot data collection in Extended Reality, finding it significantly improves task performance and user experience, especially in high-precision operations, with effects varying by control interface and task complexity.

Authors:Haci Ismail Aslan, Philipp Wiesner, Ping Xiong, Odej Kao
Title: $β$-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation
Abstract:
Graph Neural Networks (GNNs) are playing an increasingly important role in the efficient operation and security of computing systems, with applications in workload scheduling, anomaly detection, and resource management. However, their vulnerability to network perturbations poses a significant challenge. We propose $β$-GNN, a model enhancing GNN robustness without sacrificing clean data performance. $β$-GNN uses a weighted ensemble, combining any GNN with a multi-layer perceptron. A learned dynamic weight, $β$, modulates the GNN's contribution. This $β$ not only weights GNN influence but also indicates data perturbation levels, enabling proactive mitigation. Experimental results on diverse datasets show $β$-GNN's superior adversarial accuracy and attack severity quantification. Crucially, $β$-GNN avoids perturbation assumptions, preserving clean data structure and performance.
中文: 提出的β-GNN模型通过带有动态参数β的加权集成方法,在保持洁净数据性能的同时增强图神经网络对抗扰动的鲁棒性,其中β既能调节GNN影响又可量化扰动程度,显著提升了对抗攻击下的准确率。
English: The proposed β-GNN model enhances the robustness of Graph Neural Networks against network perturbations through a weighted ensemble with a dynamic parameter β, which both modulates GNN influence and quantifies perturbation levels, achieving improved adversarial accuracy without compromising clean data performance.

Authors:Ryumei Nakada, Wenlong Ji, Tianxi Cai, James Zou, Linjun Zhang
Title: A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts
Abstract:
Prompt engineering has emerged as a powerful technique for guiding large language models (LLMs) toward desired responses, significantly enhancing their performance across diverse tasks. Beyond their role as static predictors, LLMs increasingly function as intelligent agents, capable of reasoning, decision-making, and adapting dynamically to complex environments. However, the theoretical underpinnings of prompt engineering remain largely unexplored. In this paper, we introduce a formal framework demonstrating that transformer models, when provided with carefully designed prompts, can act as a configurable computational system by emulating a ``virtual'' neural network during inference. Specifically, input prompts effectively translate into the corresponding network configuration, enabling LLMs to adjust their internal computations dynamically. Building on this construction, we establish an approximation theory for $β$-times differentiable functions, proving that transformers can approximate such functions with arbitrary precision when guided by appropriately structured prompts. Moreover, our framework provides theoretical justification for several empirically successful prompt engineering techniques, including the use of longer, structured prompts, filtering irrelevant information, enhancing prompt token diversity, and leveraging multi-agent interactions. By framing LLMs as adaptable agents rather than static models, our findings underscore their potential for autonomous reasoning and problem-solving, paving the way for more robust and theoretically grounded advancements in prompt engineering and AI agent design.
中文: 提示工程通过精心设计的提示使大语言模型能够作为可配置的计算系统,动态调整内部计算,从而近似复杂函数并为经验性技术提供理论依据,推动其向自适应智能代理发展。
English: Prompt engineering enables large language models to function as configurable computational systems by dynamically adjusting their internal processes through structured prompts, allowing them to approximate complex functions and validate empirical techniques with theoretical foundations.

Authors:Zhicheng Guo, Sijie Cheng, Yuchen Niu, Hao Wang, Sicheng Zhou, Wenbing Huang, Yang Liu
Title: StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs
Abstract:
The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scalability, and realness, particularly for benchmarking purposes. To address this problem, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as "mirrors" to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.
Chinese: MirrorAPI 是一种新颖的框架,通过在有监督微调和思维链推理的基础上使用全面数据集,训练专门的大型语言模型来模拟真实 API 响应,从而在工具学习中实现了卓越的准确性和稳定性。
English: MirrorAPI is a novel framework that trains specialized large language models to simulate real API responses, achieving superior accuracy and stability for tool learning by using supervised fine-tuning and chain-of-thought reasoning on a comprehensive dataset.

Authors:Ji Woo Hong, Tri Ton, Trung X. Pham, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Title: ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
Abstract:
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.
中文: ITA-MDT框架采用自适应掩码扩散变换器进行虚拟试穿,通过动态特征聚合和针对性调节,在提升细节保留的同时优化了计算效率。
English: The ITA-MDT framework introduces an adaptive masked diffusion transformer for virtual try-on, enhancing detail preservation and efficiency through dynamic feature aggregation and targeted conditioning.

Authors:Muhammed Shafi K. P., Serena Nicolazzo, Antonino Nocera, Vinod P
Title: How Secure is Forgetting? Linking Machine Unlearning to Machine Learning Attacks
Abstract:
As Machine Learning (ML) evolves, the complexity and sophistication of security threats against this paradigm continue to grow as well, threatening data privacy and model integrity. In response, Machine Unlearning (MU) is a recent technology that aims to remove the influence of specific data from a trained model, enabling compliance with privacy regulations and user requests. This can be done for privacy compliance (e.g., GDPR's right to be forgotten) or model refinement. However, the intersection between classical threats in ML and MU remains largely unexplored. In this Systematization of Knowledge (SoK), we provide a structured analysis of security threats in ML and their implications for MU. We analyze four major attack classes, namely, Backdoor Attacks, Membership Inference Attacks (MIA), Adversarial Attacks, and Inversion Attacks, we investigate their impact on MU and propose a novel classification based on how they are usually used in this context. Finally, we identify open challenges, including ethical considerations, and explore promising future research directions, paving the way for future research in secure and privacy-preserving Machine Unlearning.
中文: 本知识系统化研究分析了机器学习中的安全威胁及其对机器遗忘的影响,提出了一种新的分类方法,并为安全遗忘的未来研究指明了方向。
English: This Systematization of Knowledge analyzes security threats in machine learning and their implications for machine unlearning, proposing a novel classification and identifying future research directions for secure unlearning.

Authors:Qiusheng Huang, Xiaohui Zhong, Xu Fan, Lei Chen, Hao Li
Title: FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling
Abstract:
Similar to conventional video generation, current deep learning-based weather prediction frameworks often lack explicit physical constraints, leading to unphysical outputs that limit their reliability for operational forecasting. Among various physical processes requiring proper representation, radiation plays a fundamental role as it drives Earth's weather and climate systems. However, accurate simulation of radiative transfer processes remains challenging for traditional numerical weather prediction (NWP) models due to their inherent complexity and high computational costs. Here, we propose FuXi-RTM, a hybrid physics-guided deep learning framework designed to enhance weather forecast accuracy while enforcing physical consistency. FuXi-RTM integrates a primary forecasting model (FuXi) with a fixed deep learning-based radiative transfer model (DLRTM) surrogate that efficiently replaces conventional radiation parameterization schemes. This represents the first deep learning-based weather forecasting framework to explicitly incorporate physical process modeling. Evaluated over a comprehensive 5-year dataset, FuXi-RTM outperforms its unconstrained counterpart in 88.51% of 3320 variable and lead time combinations, with improvements in radiative flux predictions. By incorporating additional physical processes, FuXi-RTM paves the way for next-generation weather forecasting systems that are both accurate and physically consistent.
中文:当前深度学习天气预报模型常缺乏物理约束,而FuXi-RTM提出了一种混合物理引导框架,通过集成辐射传输模型来提高预报精度和物理一致性。
English: Current deep learning weather prediction models often lack physical constraints, but FuXi-RTM introduces a hybrid physics-guided framework that integrates radiative transfer modeling to enhance forecast accuracy and physical consistency.

Authors:Yuncheng Lu, Shuang Liang, Hongxiang Fan, Ce Guo, Wayne Luk, Paul H. J. Kelly
Title: Versatile Cross-platform Compilation Toolchain for Schrödinger-style Quantum Circuit Simulation
Abstract:
While existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different hardware platforms is challenging. To address such needs, we propose CAST (Cross-platform Adaptive Schrödiner-style Simulation Toolchain), a novel compilation toolchain with cross-platform (CPU and Nvidia GPU) optimization and high-performance backend supports. CAST exploits a novel sparsity-aware gate fusion algorithm that automatically selects the best fusion strategy and backend configuration for targeted hardware platforms. CAST also aims to offer versatile and high-performance backend for different hardware platforms. To this end, CAST provides an LLVM IR-based vectorization optimization for various CPU architectures and instruction sets, as well as a PTX-based code generator for Nvidia GPU support. We benchmark CAST against IBM Qiskit, Google QSimCirq, Nvidia cuQuantum backend, and other high-performance simulators. On various 32-qubit CPU-based benchmarks, CAST is able to achieve up to 8.03x speedup than Qiskit. On various 30-qubit GPU-based benchmarks, CAST is able to achieve up to 39.3x speedup than Nvidia cuQuantum backend.
中文:CAST是一个跨平台量子模拟工具链,通过自适应门融合和硬件后端优化,在CPU和GPU平台上均实现了相比现有模拟器的显著加速性能。
English: CAST is a cross-platform quantum simulation toolchain that optimizes performance for both CPUs and GPUs through adaptive gate fusion and backend configurations, achieving significant speedups over existing simulators.

Authors:Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, Julia Hockenmaier, Graham Neubig, Sean Welleck
Title: Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
Abstract:
As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.
中文: 本研究探讨通过增加测试时计算能否提升语言模型的评估能力,采用推理模型对结果和过程进行双重评估,发现更多推理标记能提高评估性能,并通过重排序有效增强问题解决能力。
English: This study explores whether increasing test-time compute can enhance language models' evaluation capabilities by employing reasoning models that assess both outcomes and processes, finding that more reasoning tokens improve evaluator performance and can effectively boost problem-solving abilities through reranking.

Authors:Junzhi Ning, Dominic Marshall, Yijian Gao, Xiaodan Xing Yang Nan, Yingying Fang, Sheng Zhang, Matthieu Komorowski, Guang Yang
Title: Unpaired Translation of Chest X-ray Images for Lung Opacity Diagnosis via Adaptive Activation Masks and Cross-Domain Alignment
Abstract:
Chest X-ray radiographs (CXRs) play a pivotal role in diagnosing and monitoring cardiopulmonary diseases. However, lung opacities in CXRs frequently obscure anatomical structures, impeding clear identification of lung borders and complicating the localization of pathology. This challenge significantly hampers segmentation accuracy and precise lesion identification, which are crucial for diagnosis. To tackle these issues, our study proposes an unpaired CXR translation framework that converts CXRs with lung opacities into counterparts without lung opacities while preserving semantic features. Central to our approach is the use of adaptive activation masks to selectively modify opacity regions in lung CXRs. Cross-domain alignment ensures translated CXRs without opacity issues align with feature maps and prediction labels from a pre-trained CXR lesion classifier, facilitating the interpretability of the translation process. We validate our method using RSNA, MIMIC-CXR-JPG and JSRT datasets, demonstrating superior translation quality through lower Frechet Inception Distance (FID) and Kernel Inception Distance (KID) scores compared to existing methods (FID: 67.18 vs. 210.4, KID: 0.01604 vs. 0.225). Evaluation on RSNA opacity, MIMIC acute respiratory distress syndrome (ARDS) patient CXRs and JSRT CXRs show our method enhances segmentation accuracy of lung borders and improves lesion classification, further underscoring its potential in clinical settings (RSNA: mIoU: 76.58% vs. 62.58%, Sensitivity: 85.58% vs. 77.03%; MIMIC ARDS: mIoU: 86.20% vs. 72.07%, Sensitivity: 92.68% vs. 86.85%; JSRT: mIoU: 91.08% vs. 85.6%, Sensitivity: 97.62% vs. 95.04%). Our approach advances CXR imaging analysis, especially in investigating segmentation impacts through image translation techniques.
中文: 本研究提出了一种非配对胸片转换框架,能有效将含肺部阴影的X光图像转化为无阴影图像并保留关键特征,显著提升了多个数据集的肺部分割精度和病灶分类性能。
English: This study introduces an unpaired CXR translation framework that effectively converts chest X-rays with lung opacities into clear images while preserving diagnostic features, significantly improving segmentation accuracy and lesion classification across multiple datasets.